Estimating how many there are of something when you can’t see them all perfectly

A brief introduction to quantification

Essay

Statistics

Published

November 14, 2024

I wrote a tweet recently complaining about how it’s hard to estimate how many of what kind of posts there are.

You guys really have no idea how hard it is to estimate how many of what kind of posts there are
— Colin Fraser (@colin_fraser) October 16, 2024

This was a rare window on X into my professional life, which involves estimating how many posts are against the rules on the social media app that I work for. For many reasons, this turns out to be significantly harder than you might initially expect. This blog post is about just one of those reasons, a particular statistical quirk that arises in estimating a prevalence under measurement error. The tl;dr is that if there is any measurement error whatsoever, then a naive estimation procedure is almost guaranteed to produce a biased estimate of prevalence. This might be a bit surprising, because that’s emphatically not the case most of the time: usually, when you have measurements which may contain errors, you can expect that the positive errors and the negative errors will cancel out, leading to unbiased estimation. The result of measurement error is noise, but not bias. This fact is a bedrock of most of applied statistics. But in the case where observations are binary—posts are either against the rules or they’re not—the situation is different, and it’s sadly the case that any measurement error at all introduces a fairly complicated form of bias.

The basic problem of estimating prevalence

Say you want to estimate how much there is of some kind of thing. There’s a big set of things, some of which satisfy some property, and others of which don’t, and you want to know what fraction of them have the property. This fraction is the prevalence of the property, which I’ll denote as \(\mu\).

The most straightforward way to do this in theory would be to look at each thing and tally up how many of them have the property, but this is often infeasible. I can’t personally inspect every post and decide whether it violates the rules. Instead, you have some process that assigns a label to each thing. The label is meant to indicate whether the thing has the property, but sometimes it’s wrong.

I’m going to introduce some simple notation. Let \(Y\) be the true value of a randomly selected object, with \(Y=1\) indicating that the object has the property and \(Y=0\) otherwise, and let \(L\) be its label. If \(Y=1\), I’ll call the object an “actual positive”, and if \(L = 1\), I’ll call it an “apparent positive”. The goal is to estimate the true prevalence \(\mu = P(Y=1)\). By the way, a nice thing about binary variables like this is that we can also write \(P(Y=1)=E[Y]\) which is how I will primarily describe \(\mu\) here, but it’s useful to remember that these are the same.

A natural inclination is to just treat \(L\) as a proxy for \(Y\) and estimate the apparent prevalence \(E[L]=P(L=1)\), which I’ll denote in this post as \(\ell\) (for label).

This post is about why that doesn’t work.

A few examples of this problem in practice

I’m describing this all very abstractly, and the reason is that this situation actually comes up all the time, in all kinds of different ways. Here are a few examples.

You could be trying to estimate the fraction of people in a population who carry some disease. You can’t observe every person in the population, and even if you could, you can’t know for sure whether any person actually has the disease. All you can know is the outcome of a test which is administered to them. The test is imperfect, and occasionally produces false positives and false negatives. In this case, \(Y\) describes whether a randomly selected person actually has the disease, and \(L\) indicates whether they test positive.

Or maybe \(\mu\) is the fraction of examples from some LLM benchmark which contain a hallucination. Such a benchmark might have thousands of prompts, and so it might be infeasible to manually review them all and assess whether they contain hallucinations. To deal with this, many benchmarks of this type use another LLM—ideally one that is more powerful in some sense—to evaluate whether responses contain hallucinations. But this evaluator LLM itself can be error-prone: it can falsely indicate that hallucination-free text contains hallucinations (a false positive), and vice versa (a false negative).

Perhaps instead, \(\mu\) is the amount of discussion on a microblogging app about The Academy Awards. You can’t look at every single post, but you can for example count up how many posts contain the string “Oscar”. Of course, this will falsely count posts discussing Oscar The Grouch (false positives), and it will falsely miss posts which don’t refer to the awards by name at all.

In my specific case, \(\mu\) is the fraction of all posts on a particular social media app which violate the rules. To determine this, we show a sample of posts to people for review, who label each as either violating or not. But sometimes the reviewers make a mistake.

All of these situations are abstractly the same. They all involve trying to estimate some true prevalence \(\mu\) by observing examples of possibly imperfect labels \(L\). Again, the natural inclination is to just use an imperfect label \(L\) like it’s a true value \(Y\). Even if you’re aware that your labels are imperfect, maybe somehow or another the true positives and false positives will cancel out in the end, leading to something which might be noisy but is at least right on average. This does happen with other forms of measurement error. Unfortunately, it doesn’t happen here.

Quantifying Label Quality

To better understand what does happen, we need two important measures of label quality: the true positive rate (TPR) and the false positive rate (FPR). The TPR, which I’ll also denote by \(\alpha\), is the probability that an actual positive is correctly labeled. Using the notation from above, it can be written as \(\alpha = P(L=1|Y=1)\). This quantity goes by many names: when the labels come from a machine learning model, it’s often called the recall, and when they come from a medical test, it’s called the sensitivity. The FPR, denoted by \(\beta\), is the probability that an actual negative is falsely identified as an apparent positive: \(\beta = P(L=1|Y=0)\).

These two measures characterize imperfectness of the labeling process. A perfect labeler would have \(\alpha = 1\) and \(\beta = 0\). An imperfect labeler will have \(\alpha < 1\) and/or \(\beta > 0\). It’s a fact that if the labels are better than coin flips, we must have \(\alpha > \beta\). For the most part I’ll assume that this holds, but it will be interesting to think through what happens if it doesn’t.

Quantifying the bias

With these defined, it’s pretty straightforward to see that the apparent prevalence \(\ell\) can be written as follows.

\[\begin{align*} \ell &= E[L] \\ &= E[L | Y = 1] P (Y = 1) + E[L|Y = 0] P(Y=0) \\ &= \alpha \mu + \beta (1-\mu) \\ & = \beta + (\alpha - \beta) \mu \end{align*}\]

This simple equation holds many important truths. Naturally, if we have a perfect labeler with \(\alpha = 1\) and \(\beta = 0\), it says that the apparent prevalence \(\ell\) equals the true prevalence \(\mu\). But otherwise, it says that these can’t be equal in general. The apparent prevalence ends up differing from the actual prevalence, and the amount by which it differs depends on all of \(\mu\), \(\alpha\), and \(\beta\). As a function of \(\mu\), we have a straight line with intercept \(\beta\), and slope \(\alpha - \beta\).

There is a single point, which I’ve labeled \(\mu'\), at which the actual prevalence is equal to the apparent prevalence, but for every other possible value of prevalence, the apparent prevalence differs. The magnitude and even the direction of this bias can take on a range of different values depending on the true prevalence. In a way, this is pretty disappointing news. It means that unless you have perfect labels, you’re virtually guaranteed to estimate prevalence incorrectly, and you can’t even say for sure in general how big the inaccuracy is.

There are a few other bits of insight that we can obtain by studying this graph. For one thing, the relationship between the true and apparent prevalence is always flatter than the 45 degree line. This means that for small values of the actual prevalence (anywhere to the left of \(\mu'\)), we will tend to overestimate, and vice versa. If you have some sense of the approximate magnitude of the true prevalence, you can use this as a kind of rule of thumb to guess the direction of the bias, even if you don’t know the true and false positive rates for sure. If you’re trying to measure a very small prevalence with an imperfect test, you’re probably overestimating, and vice versa.

It also means that this estimation procedure will tend to understate the magnitude of changes in prevalence: when prevalence changes from \(\mu_0\) to \(\mu_0 + \Delta\), the estimate will change by \((\alpha - \beta)\Delta\), which is strictly less (in absolute value) than \(\Delta\). The apparent prevalence is squished in between \(\alpha\) and \(\beta\). In the extreme case where \(\alpha = \beta\), the line becomes flat, and you’ll end up estimating that prevalence is equal to \(\beta\) on average no matter its true value. Given that a labeler with \(\alpha=\beta\) is not better than a coin flip, it shouldn’t be surprising that labels generated in this way give no information about the true prevalence. Nonetheless, I’ve noticed in the real world that this tends to be a bit unintuitive. It’s tempting to expect that any labeling process will lead to estimates which are “directionally correct” even in the face of measurement error, but this shows that that’s not necessarily true! If the false positive rate is close enough to the true positive rate, the estimate becomes pure noise. In the perverse scenario where the FPR is higher than the TPR, the slope \(\alpha - \beta\) is negative, and increases in the true prevalence lead to decreases in the apparent prevalence, and vice versa. You would hope never to find yourself in this scenario, but it’s not impossible, and it’s good to be aware of this possibility.

All of this is particularly problematic if your project is to track the prevalence of something over time. If you’re tracking the progress of some disease which currently has a small prevalence, for example, it means that small upticks in the estimated prevalence probably indicate larger increases in the underlying true prevalence. But the more the disease spreads, the smaller that bias gets, until eventually it may become negative. This is very annoying.

It also has implications for experimentation. Suppose you are testing some intervention which is intended to change the prevalence. You’ll do this by comparing the prevalence in a test group to the prevalence in a control group. But with imperfect labels, you will underestimate the magnitude of the difference between the two groups.

Formulation in terms of precision and recall

(This section is slightly wonky and you can safely skip over it.)

If the labels come from a machine learning model, it’s common to talk about the precision rather than the FPR. I think it’s better to use the FPR for reasons I’ll talk about momentarily, but there is a nice neat formula relating the precision and recall to prevalence, so I may as well include it. In terms of our notation, the precision, which I’ll denote by \(\pi\), can be written as \(\pi=P(Y=1|L=1)\), the probability that an apparent positive is actually positive. Applying Bayes’ theorem, we can write the following.

\[\begin{align*} \pi &= \frac{P(L=1|Y=1)P(Y=1)}{P(L=1)} \\ &= \frac{\alpha \mu} {\ell} \end{align*}\]

Rearranging gives an expression for \(\ell\) in terms of the precision and recall.

\[\ell = \frac{\alpha \mu}{\pi}\]

The reason I don’t like this framing is that the precision itself is sneakily a function of the prevalence, whereas one can somewhat reasonably assume that the TPR and FPR are not (though this is also an assumption). This is clear from the previous derivation: \(\pi = \frac{\alpha \mu} {\ell}\). It’s also easy to see if you think about the extreme cases: if the prevalence is \(0\) then the precision must be \(0\), since any apparent positive will not be an actual positive. Similarly, if the prevalence is \(1\) then the precision must be \(1\). So the previous equation gives a misleadingly simplistic perspective on \(\ell\) as a function of prevalence.

Nonetheless it does provide some interesting alternative ways of looking at this situation. If \(\ell = \frac{\alpha \mu}{\pi}\), it follows that \(\ell = \mu\) if and only if \(\alpha = \pi\), if the precision is equal to the recall. If labels are perfect then this will be true: precision and recall will both be \(1\). Otherwise, for a given recall (and assuming a fixed FPR), this will only be true at some specific value of prevalence—in particular, the value \(\mu '\) from before.

This also provides a way of thinking about how the quantifier bias relates to the precision-recall trade-off. For a high precision labeler, the ratio \(\alpha/\pi\) is small, and so the apparent prevalence will understate the true prevalence. It makes sense. A high precision classifier only produces apparent positives when it’s very sure, overlooking less certain cases, leading to an underestimation of the true prevalence. The opposite is true for a high recall labeler: \(\alpha/\pi\) is large, and so \(\ell > \mu\). With a high recall labeler we are casting a wide net, including many actual negatives in our estimation of prevalence.

Some discussion and proposed solutions from the literature

When the source of the imperfect labels is a machine learning model, the problem of estimating the true prevalence has sometimes been called Quantification—a secret third brother of the classical supervised tasks of Regression and Classification. In this context, the naive approach using \(\ell\) as a proxy for the true prevalence has been called the “Classify And Count” method by Forman (2008), who proposes an alternative method which corrects the bias, appropriately called the “Adjusted Classify-And-Count” (ACC) method. Given \(\alpha\) and \(\beta\), the ACC estimator is obtained by simply solving the equation \(\ell = \beta + (\alpha - \beta) \mu\) for \(\mu\), leading to:

\[\hat \mu_{ACC}=\frac{\ell - \beta}{\alpha - \beta}\]

This is not the first time this problem has been noticed. Rogan and Gladen (1978) discuss how the prevalence of diseases can be incorrectly estimated when we rely on imperfect tests, and propose the following correction, thereafter known as the Rogan-Gladen estimator.

\[\hat \mu _{RG} = \frac{\ell + \text{Specificity} - 1}{\text{Sensitivity} + \text{Specificity} - 1}\]

“Specificity” and “sensitivity” are more frequently used to describe medical tests than true and false positive rates, but it turns out this is actually exactly equivalent to the ACC correction. Sensitivity is just a synonym for recall, and the definition of specificity is \(P(L=0 | Y = 0)\), which happens to be equal to \(1 - \beta\). Substituting these in to \(\hat \mu_{RG}\), you get the exact formula for \(\hat \mu _{ACC}\). Funny how things are discovered and rediscovered. I’m sure someone must have noticed this connection before, but I haven’t actually seen it written anywhere.

Conclusion

There’s a lot more to talk about here. There are many other methods which are more sophisticated than ACC/Rogan-Gladen, and we haven’t even touched on the notion of a confidence interval here. And the solution above really does just kick the can down the road: it’s no easier in general to obtain reliable estimates of the TPR and FPR than it is to estimate prevalence directly. There’s a tricky resource optimization problem lurking here: is it cheaper to get high quality TPR and FPR estimates in order to build a Quantifier, or would you be better off just estimating prevalence directly by more traditional means? This is also not the only thing that makes it hard to estimate how many of what kind of posts there are. All kinds of other issues like severe class imbalance, non-response bias, and others all arise and compete and interact with each other to make it a very complicated task.

These are all perhaps topics for a future post, but the goal of this post has just been to raise awareness of this one tricky issue that I’ve rarely seen discussed, especially given its apparent ubiquity across different fields, and its annoyingness in my own day-to-day life.

References

Forman G. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery. 2008 Oct;17:164-206.
Rogan WJ, Gladen B. Estimating prevalence from the results of a screening test. American journal of epidemiology. 1978 Jan 1;107(1):71-6.