Most observational empirical studies in economics and other disciplines need to account for this general spatial connectedness of things. In principal, you can do this two ways: (1) develop a model of the spatial relationship; (2) account for the spatial connectedness by appropriately adjusting the standard errors of your regression model.

The first option is a truly heroic one, and most all attempts I've seen seem foolhardy. Spatial geographic patters are extremely complex and follow from deep geophysical and social histories (read Guns, Germs and Steal). One is unlikely to uncover the full mechanism that underlies the spatial pattern. When one "models" this spatial pattern, assumptions drive the result, and the assumptions are, almost always, a heroic leap of faith.

That leaves (2), which shouldn't be all that difficult using modern statistical techniques, but does take some care and perhaps a little experimentation. It seems to me many are a little too blithe about it, and perhaps select methods that falsely exaggerate statistical significance.

Essentially, the problem is that there's normally a lot less information in a large data set than you think, because most observations from a region and/or time are correlated with other observations from that region and/or time. In statistical speak, the errors are clustered.

To illustrate how much this matters, I'll share some preliminary regressions from a current project of mine. Here I am predicting the natural log of corn yield using field-level data that span about 15 years on most of the corn fields in three major corn-producing U.S. states. I've got several hundred thousand observations. Yes, you read that right--it's a very rich data set.

But corn yields, as you can probably guess, tend to have a lot of spatial correlation. This happens in large part because weather, soils, and farming practices are spatially correlated. However, there isn't a lot of serial correlation in weather from year to year, so, my data are highly correlated within years, and average outcomes have strong geographic correlation, but errors are mostly independent between years in a fixed location.

Where the amount of information in the data normally scales with the square root of the sample size, when the data are clustered spatially or otherwise, a conservative estimate for the amount of information is the square root of the number of clusters you have. In this data set, we don't really have fixed clusters. It's more like smooth overlapping clusters. But we might proxy the "number" of clusters around the square root of 45, the number of years X states I have, because most spatial correlation in weather fades out after about 500 miles. Although these states border each other, so it may be even less than 45. Now, I do have weather matched to each field depending on the field's individual planting date, which can vary a fair amount. That adds some statistical power. So, I hope it's a bit better than the square root of 45. Either way, in the ballpark of 45 is a whole lot less than several hundred thousand.

I regress the natural log of corn yield on

YEAR: a time trend

log (potential): (output of a crop model calibrated from daily weather inputs),

gdd: growing degree days (a temperature measure),

DD29: degree days above 29C (a preferred measure of extreme heat),

prec & prec^2: season precipitation and precipitation squared,

PDay: number of days since Jan 1 until planting

interaction between DD29 and CO2 exposure.

CO2 exposure varies a little bit spatially, and also temporally, both due to a trend from burning fossil fuels and other emissions, as well as seasonal fluctuations following from tree and leaf growth (earlier planting tends to have higher CO2, and higher CO2 can improve

The standard regression output gives:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.320e+00 3.014e-02 76.98 2e-16

I(YEAR - 2000) 1.291e-02 4.600e-04 28.06 2e-16

log(Potential) 5.697e-01 5.470e-03 104.14 2e-16

gdd 1.931e-04 4.177e-06 46.24 2e-16

DD29 -2.477e-02 1.149e-03 -21.56 2e-16

Prec 1.787e-02 9.424e-04 18.96 2e-16

I(Prec^2) -4.939e-04 2.038e-05 -24.24 2e-16

PDay -6.798e-03 6.269e-05 -108.45 2e-16

DD29:AvgCO2 6.229e-05 2.953e-06 21.09 2e-16

Most people now use White "robust" standard errors, which uses a variance-covariance matrix constructed from the residuals to account for arbitrary heteroscedasticity. Here's what that gives you:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.319894e+00 3.954834e-02 58.65970 0.000000e+00

I(YEAR - 2000) 1.290703e-02 5.362464e-04 24.06922 5.252870e-128

log(Potential) 5.696738e-01 7.161458e-03 79.54718 0.000000e+00

gdd 1.931294e-04 5.058033e-06 38.18271 0.000000e+00

DD29 -2.477002e-02 1.397239e-03 -17.72783 2.557376e-70

Prec 1.786707e-02 1.099087e-03 16.25627 2.016306e-59

I(Prec^2) -4.938967e-04 2.327153e-05 -21.22321 5.830391e-100

PDay -6.798270e-03 7.381894e-05 -92.09386 0.000000e+00

DD29:AvgCO2 6.229397e-05 3.616307e-06 17.22585 1.698989e-66

The standard errors are larger and the T-values smaller, but this standard approach still gives us extraordinary confidence in our estimates.

You should remain skeptical. Here's what happens when I use robust standard errors clustered by year:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.32e+00 5.57e-01 4.17 3.094e-05 ***

YEAR 1.29e-02 8.57e-03 1.52 0.12920

log(Potential) 5.70e-01 9.11e-02 6.25 4.000e-10 ***

gdd 1.93e-04 7.89e-05 2.45 0.01443 *

DD29 -2.48e-02 1.35e-02 -1.83 0.06719 .

Prec 1.79e-02 1.06e-02 1.68 0.09243 .

I(Prec^2) -4.94e-04 2.15e-04 -2.29 0.02178 *

PDay -6.80e-03 8.17e-04 -8.32 2.2e-16 ***

DD29:AvgCO2 6.23e-05 3.50e-05 1.78 0.07510 .

Standard errors are an order of magnitude larger and T-values are more humbling. Planting date and potential yield come in very strong, but now everything else is just borderline significant. It seems robust standard errors really aren't so robust.

But even if we cluster by year, we are probably missing some important dependence, since geographic regions may have similar errors across years, and in clustering by year, I assume all errors in one year are independent of all errors in other years.

If I cluster by state, the standard robust/clustering procedure will account for both geographic and time-series dependence within a state. Since I know from earlier work that one state is about the extent of spatial correlation, this seems reasonable. Here's what I get:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.32e+00 1.1888e+00 1.9514 0.0510065 .

YEAR 1.29e-02 4.6411e-03 2.7810 0.0054194 **

log(Potential) 5.70e-01 1.6938e-01 3.3632 0.0007706 ***

gdd 1.93e-04 2.2126e-04 0.8729 0.3827338

DD29 -2.48e-02 2.6696e-02 -0.9279 0.3534781

Prec 1.79e-02 1.2786e-02 1.3974 0.1622882

I(Prec^2) -4.94e-04 2.7371e-04 -1.8045 0.0711586 .

PDay -6.80e-03 4.9912e-04 -13.6205 < 2.2e-16 ***

DD29:AvgCO2 6.23e-05 6.8565e-05 0.9085 0.3635962

Oops. Now most of the weather variables have lost their statistical significance too. But since I'm explicitly limiting assumed dependence in the cross section within years, now the time trend (YEAR) is significant, and it wasn't when clustering by YEAR. We probably shouldn't take that significance very seriously, since some kinds of dependence (like technology) probably spans well beyond one state.

Note that this strategy of using large clusters combined with robust SE treatment (canned in STATA, for example) is what's recommended in Angrist and Pischke's

*Mostly Harmless Econometrics*.

There are other ways of dealing with these kinds of problems. For example, you can use a "block bootstrap" that resamples residuals whole years as a time, which preserves spatial correlation. This is great in agricultural applications since weather is pretty much IID across years in a fixed locations and we should feel reasonably comfortable that there is little serial correlation. One can also adapt the method by Conley for panel data. Soloman Hsiang has graciously provided code here. In earlier agriculture-related work, Wolfram Schlenker and I generally found that clustering by state gives similar standard errors as these methods.

The overarching lesson is: try it different ways and err on the side of least significance, because it's very easy to underestimate your standard errors and very hard to overestimate them.

And watch out for data errors: these have a way of screwing up both estimates and standard errors, sometimes quite dramatically.

If you had the patience to follow all of this, you might appreciate the footnotes and appendix in our recent comment on Deschenes and Greenstone.

we thought it was good too downunder

ReplyDeleteThank you for the post. W.R.T. watching out for data errors, it is always fun to see what random insurance agents etc. have typed in to their forms. Are you dealing with only the optional unit data or basic units too?

ReplyDeleteAlso, when clustering your data spatially, I think you will have a lot more luck if you use ecoregion instead -

http://en.wikipedia.org/wiki/List_of_ecoregions_in_the_United_States_(EPA)#cite_note-CEC-1 using the four major clusters in the I states and throwing out the data in the others.

Chris:

DeleteClustering by eco-regions is *not* big enough. Fields in bordering ecoregions will have correlated yields, since they see very similar weather. You will still be underestimating your standard errors, probably by a lot. Indeed, as I showed above, clustering by state probably isn't big enough, and states tend to be larger than ecoregions.

How do you know what's big enough? Won't all geographic clusters have borders with correlated yields?

DeleteMike: Yes, there will always be a little correlation at the edges of clusters, and this is a problem with the canned cluster approach. But it could be a relatively small problem if the cluster is large enough that, on average, errors from one cluster have little correlation with neighboring clusters. From looking explicitly at errors, such as a spatial correlogram, you can get a sense of how large the clusters should be. Conley would be a better approach (adapted for panel regression is appropriate), if you want to deal with the issue more carefully.

DeleteAgain, my suggested rule of thumb: try it different ways and err on the side with the largest SEs and lowest statistical significance. That can be hard to stomach when you really feel you need that big t statistic. But don't fool yourself...

Michael - As a geography student who advised by an economist I am certainly happy to these discussions come up. Recently I've been applying variations of geostatistical methods (think Kriging) to look at replicating the response of global climate climate models in a computationally "cheap" way. These methods can easily be extended to crop yields. Here we have the opportunity of quantifying the spatial autocorrelation, and using that information to draw a sample distribution from the spatial data process. I've been wanting to try to apply this to a crop data set, but due to confidentially issues with yield data I'm using on my current project I am unable to. I guess my point is, there's a world outside of our panels and standard errors which we can borrow strength from. Good Post.

ReplyDelete-Jordan