There is a clich√© in media stories where figures for a disease or condition are quoted followed by a statement that “the true figures may be higher”. Sampling errors mean that initial figures are equally as likely to be under-estimates as over-estimates but we only ever seem to be told that the condition is under-detected.

For example, this is from a recent (actually pretty good) *New Scientist* article about gender identity disorder (GID) in children, a condition where children who are biologically male feel female and vice versa:

It is unclear how common GID is among children, but many transsexual adults say they felt they were “in the wrong body” from an early age. The incidence of adult transsexualism has been estimated at about 1 in 12,000 for male-to-females, and around in 1 in 30,000 for female-to-males, although transsexual lobby groups say the true figures may be far higher.

These estimates are usually drawn from prevalence studies where a maybe a few hundred or thousand people are tested. The researchers extrapolate from the number of cases to make an estimate of how many people in the population as a whole will have the condition.

These estimates are made with statistical tests which give a margin of error, meaning that within a certain range, typically described by confidence intervals, the real figure is likely to be between a range which equally includes both higher and lower values than the quoted amount.

For any individual study you can validly say that you think the estimate is too low, or indeed, too high, and give reasons for that. For instance, you might say that your sample was mainly young people who tend to be healthier than the general public, or maybe that the diagnostic tools are known to miss some true cases.

But when we look at reporting as a whole, it almost always says the condition is likely to be much more common than the estimate.

For example, have a look at the results of this Google search:

“the true number may be higher” 20,300 hits

“the true number may be lower” 3 hits

You can try variations on the phrasing and see the same sort of pattern emerges. I’m curious as to why this bias occurs, or whether there’s another explanation for it.

> “the true number may be higher” 20,300 hits

Yes, but the true number may be higher.

It’s simply because if it’s a bigger number, it’s a bigger story.

This may just be my cynicism showing but I think it’s because generally the people responsible for estimating the prevalence of a given disease are the same people with an interest in making that prevalence seem high.

Say you’re an academic who researches GID (or anything else). If you can say “GID affects maybe 1 in 100 people” that looks a lot better when you’re asking for grant money than if you say “GID affects 1 in 100,000 people”.

Or say you’re an activist for the interests of people with GID, you will naturally want to make it seem as common as possible because this is, unfortunately, equated with “normality” in most people’s minds.

Yet Google also says:

Results 1 – 10 of about 284 for “the true number may be higher” -“the true number may be higher or lower”. (0.19 seconds)

Ah, search engines. And english.

To me it seems like that same bias for hypothesis-proved rather than hypothesis-disproved results. The latter ones are more often filed away and neglected and the former ones have a better chance of being sent and accepted for publication.

It’s actually good statistical reasoning to assume that rare events are underestimated in their frequency. A Bayesian statistical analysis can demonstrate this. In short, if you have N observations of a Poisson distributed (rare) random variable, then the maximum likelihood estimate of the mean of the underlying process (lambda) is N.

However, one could take a proper Bayesian approach and ask, what is the probability that the true mean, lambda, is greater than this estimate? It turns out that the smaller N is the more likely it is that the true rate is greater.

To see this, place the conjugate Gamma prior on lambda so that the posterior distribution on lambda is also Gamma distributed to obtain,

p(lambda|n)

Now we can ask, what is the probability that lambda is actually greater than the mean of this distribution or even greater than the maximum aposteriori estimate of lambda? It turns out the answer is more or less independent of the parameters of the prior in this case. And its quite high. See figure here:

This result also holds for binomial distributions with uniform prior. x=lambda in the figure. For the Poisson case, the observed rate has been normalized by 1000 to fit on the same plot.

here’s a hasty blog post explaining why this ‘bias’ is rational from a Bayesian statistical perspective:

http://justaddnoise.blogspot.com/2009/10/here-is-probability-that-true.html

but the short answer is just that the median estimate is greater than the mean and maximum a posterori estimates of small quantities and so it’s actually likely that small quantities are underestimated.