### A Note About Measurement Error (Wonkish)

This post is for any statistics/econometrics followers that may be out there.

It is well known that random measurement error in a right-hand-side variable of a regression causes attenuation bias---the estimated coefficient is biased toward zero.

I have seen a few instances where people have misapplied the idea of attenuation bias to situations in which a proxy is used in place of an actual right-hand variable.  For example, the county average of a variable is substituted for an individual measure.

I've been thinking about these issues for awhile as they pertain to our statistical work linking crop yields and weather, and trying to make inferences about climate change.  All of the weather measures have error, however we have a lot more confidence in our temperature measures than we do in our precipitation measures.  The question is whether having a lot of measurement error in precipitation is causing attenuation or other kinds of biases.

While we need to make some of the usual heroic assumptions typical in regression analysis, I don't think standard measurement error applies to the case where we use a proxy or an estimate in place of the truth.

Here's the basic idea:

In the classic case of measurement error we have a model that assumes the truth is:

Y = XB + e

where Y is vector of outcome values for the thing we're predicting, X is a matrix of explanatory variables, with rows spanning observations and columns spanning the variables, B is a column vector of coefficients and e is a random error.

Now suppose X is not observed; instead we observe

Xhat = X + u, where u is error

Thus, Xhat contains measurement error and has a larger variance than the true X.  The dependent variable is:

Y =  XhatB – uB + e

If we regress Y on Xhat the error will include -uB, which is obviously correlated with Xhat, since Xhat includes u.  Since the explanatory variable is correlated with the error, our estimate of B will be biased if we use OLS.

Now consider the second case, in which we have an imperfect estimate of X.

Here, we might say

X = Xhat + u, where u is random and may or may not be correlated with e.  We just want to be able to assume that u is not correlated with Xhat, which seems reasonable; because if they were correlated our Xhat could be improved.

Thus, Xhat is a proxy for X but with less variance.  So we have

Y = XhatB + uB + e

Now, uB is part of the error, but it is part of X, not part of Xhat, the thing we observe.

Here the regression model error, ub+e is not correlated with Xhat.  This is true even if u and e are correlated.  So we get an unbiased estimate of B.

So, while our predictions will be less accurate if we use some kind of smoothed proxy for the true variable, they will not necessarily be biased.  Or, if they are biased, they won't be biased in the sense of classical measurement error.

Update: David Lobell wrote to tell me that the second kind of error is called a Berkson error and goes back to a 1950 JASA article.  I knew this wasn't new--I've seen it lots of times.  But I often do see confusion between these two kinds of errors.