Saturday, May 18, 2013

Do journal impact factors distort science?

From my inbox:
An ad hoc coalition of unlikely insurgents -- scientists, journal editors and publishers, scholarly societies, and research funders across many scientific disciplines -- today posted an international declaration calling on the world scientific community to eliminate the role of the journal impact factor (JIF) in evaluating research for funding, hiring, promotion, or institutional effectiveness.
Here's the rest of the story at Science Daily:

And a link to DORA, the "ad hoc coalition" in question.

It seems fairly obvious that impact factors do distort science.  But I wonder how much, and I also wonder if there are realistic alternatives that would do a better job of encouraging good science.

There are delicate tradeoffs here: some literatures seem to become mired within their own dark corners, forming small circles of scholars that speak a common language.  They review each others' work, sometimes because no one else can understand it, or sometimes because no one else cares to understand it.  The circle has high regard for itself, but the work is pointless to those residing outside of it.

At the same time, people obviously have very different ideas about what constitutes good science.

So, what does the right model for evaluating science look like?

Wednesday, May 15, 2013

Consensus Statements on Sea Level Rise

In my mailbox from the AGU:
After four days of scientific presentations about the state of knowledge on sea-level rise, the participants reached agreement on a number of important key statements. These statements are the reflection of the participants of the conference and not official positions from the sponsoring societies.
 
Earth scientists agree that the global sea level is rising at an accelerated rate overall in response to climate change.
Scientists have a professional responsibility to inform government, the public, and the private sector about the impacts of rising sea levels and extreme events, and the risks they pose.
 
The geological record indicates that the current rates of sea-level rise in many regions are unprecedented relative to rates of the last several thousand years.
Global sea-level rise has changed rapidly in the past and scientific projections show it will continue to rise over the course of this century, altering our coasts.
 
Extreme events and their associated impacts will be more damaging and pose higher risks in the immediate future than sea-level rise.
Increasing human activity, such as land use change and water management practices, adds stress to already fragile ecosystems and can affect coasts just as much as sea-level rise.
 
Sea-level rise will exacerbate the impacts of extreme events, such as hurricanes and storms, over the long-term.
Extreme events have contributed to loss of life, billions of dollars in damage to infrastructure, massive taxpayer funding for recovery, and degradation of our ecosystems.
 
In order to secure a sustainable future, society must learn to anticipate, live with, and adapt to the dynamics of a rapidly evolving coastal system.
Over time feasible choices may change as rising sea level limits certain options. Weighing the best decisions will require the sharing of scientific information, the coordination of policies and actions, and adaptive management approaches.
 
Well-informed policy decisions are imperative and should be based upon the best available science, recognizing the need for involvement of key stakeholders and relevant experts.
As we work to adapt to accelerating sea level rise, deep reductions in emissions remain one of the best ways to limit the magnitude and pace of rising seas and cut the costs of adaptation.
 

Spatial Econometric Peeves (wonkish)

Nearly all observational data show strong spatial patterns.  Location matters, partly due to geophysical attributes, partly because of history, and partly because all the things that follow from these two key factors tend to feedback and exaggerate spatial patterns.  If you're a data monkey you probably like to look at cool maps that illustrate spatial patterns, and spend a lot of time trying to make sense of them.  I know I do.

Most observational empirical studies in economics and other disciplines need to account for this general spatial connectedness of things.  In principal, you can do this two ways:  (1) develop a model of the spatial relationship; (2) account for the spatial connectedness by appropriately adjusting the standard errors of your regression model.

The first option is a truly heroic one, and most all attempts I've seen seem foolhardy.  Spatial geographic patters are extremely complex and follow from deep geophysical and social histories (read Guns, Germs and Steal). One is unlikely to uncover the full mechanism that underlies the spatial pattern.  When one "models" this spatial pattern, assumptions drive the result, and the assumptions are, almost always, a heroic leap of faith.

That leaves (2), which shouldn't be all that difficult using modern statistical techniques, but does take some care and perhaps a little experimentation.  It seems to me many are a little too blithe about it, and perhaps select methods that falsely exaggerate statistical significance.

Essentially, the problem is that there's normally a lot less information in a large data set than you think, because most observations from a region and/or time are correlated with other observations from that region and/or time.  In statistical speak, the errors are clustered.

To illustrate how much this matters, I'll share some preliminary regressions from a current project of mine.  Here I am predicting the natural log of corn yield using field-level data that span about 15 years on most of the corn fields in three major corn-producing U.S. states. I've got several hundred thousand observations. Yes, you read that right--it's a very rich data set.

But corn yields, as you can probably guess, tend to have a lot of spatial correlation. This happens in large part because weather, soils, and farming practices are spatially correlated.  However, there isn't a lot of serial correlation in weather from year to year, so, my data are highly correlated within years, and average outcomes have strong geographic correlation, but errors are mostly independent between years in a fixed location.

Where the amount of information in the data normally scales with the square root of the sample size, when the data are clustered spatially or otherwise, a conservative estimate for the amount of information is the square root of the number of clusters you have.  In this data set, we don't really have fixed clusters.  It's more like smooth overlapping clusters.  But we might proxy the "number" of clusters  around the square root of 45, the number of years X states I have, because most spatial correlation in weather fades out after about 500 miles.  Although these states border each other, so it may be even less than 45.  Now, I do have weather matched to each field depending on the field's individual planting date, which can vary a fair amount.  That adds some statistical power. So, I hope it's a bit better than the square root of 45.  Either way, in the ballpark of 45 is a whole lot less than several hundred thousand.

I regress the natural log of corn yield on

YEAR:             a time trend
log (potential):  (output of a crop model calibrated from daily weather inputs),
gdd:                  growing degree days (a temperature measure),
DD29:              degree days above 29C (a preferred measure of extreme heat),
prec & prec^2:  season precipitation and precipitation squared,
PDay:                number of days since Jan 1 until planting
interaction between DD29 and CO2 exposure.

CO2 exposure varies a little bit spatially, and also temporally, both due to a trend from burning fossil fuels and other emissions, as well as seasonal fluctuations following from tree and leaf growth (earlier planting tends to have higher CO2, and higher CO2 can improve radiation water use efficiency in corn, which can effectively make the plants more drought tolerant).

The standard regression output gives:

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.320e+00  3.014e-02   76.98   2e-16
I(YEAR - 2000)  1.291e-02  4.600e-04   28.06   2e-16
log(Potential)  5.697e-01  5.470e-03  104.14   2e-16
gdd             1.931e-04  4.177e-06   46.24   2e-16
DD29           -2.477e-02  1.149e-03  -21.56   2e-16
Prec            1.787e-02  9.424e-04   18.96   2e-16
I(Prec^2)      -4.939e-04  2.038e-05  -24.24   2e-16
PDay           -6.798e-03  6.269e-05 -108.45   2e-16
DD29:AvgCO2     6.229e-05  2.953e-06   21.09   2e-16

Notice the huge t-statistics: all the parameters look precisely identified.  But you should be skeptical.

Most people now use White "robust" standard errors, which uses a variance-covariance matrix constructed from the residuals to account for arbitrary heteroscedasticity.  Here's what that gives you:


                    Estimate   Std. Error   t value      Pr(>|t|)
(Intercept)     2.319894e+00 3.954834e-02  58.65970  0.000000e+00
I(YEAR - 2000)  1.290703e-02 5.362464e-04  24.06922 5.252870e-128
log(Potential)  5.696738e-01 7.161458e-03  79.54718  0.000000e+00
gdd             1.931294e-04 5.058033e-06  38.18271  0.000000e+00
DD29           -2.477002e-02 1.397239e-03 -17.72783  2.557376e-70
Prec            1.786707e-02 1.099087e-03  16.25627  2.016306e-59
I(Prec^2)      -4.938967e-04 2.327153e-05 -21.22321 5.830391e-100
PDay           -6.798270e-03 7.381894e-05 -92.09386  0.000000e+00
DD29:AvgCO2     6.229397e-05 3.616307e-06  17.22585  1.698989e-66

The standard errors are larger and the T-values smaller, but this standard approach still gives us extraordinary confidence in our estimates.

You should remain skeptical.  Here's what happens when I use robust standard errors clustered by year:


                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.32e+00  5.57e-01  4.17 3.094e-05 ***
YEAR            1.29e-02  8.57e-03  1.52   0.12920    
log(Potential)  5.70e-01  9.11e-02  6.25 4.000e-10 ***
gdd             1.93e-04  7.89e-05  2.45   0.01443 *  
DD29           -2.48e-02  1.35e-02 -1.83   0.06719 .  
Prec            1.79e-02  1.06e-02  1.68   0.09243 .  
I(Prec^2)      -4.94e-04  2.15e-04 -2.29   0.02178 *  
PDay           -6.80e-03  8.17e-04 -8.32   2.2e-16 ***
DD29:AvgCO2     6.23e-05  3.50e-05  1.78   0.07510 .  


Standard errors are an order of magnitude larger and T-values are more humbling.  Planting date and potential yield come in very strong, but now everything else is just borderline significant.  It seems robust standard errors really aren't so robust.

But even if we cluster by year, we are probably missing some important dependence, since geographic regions may have similar errors across years, and in clustering by year, I assume all errors in one year are independent of all errors in other years.

If I cluster by state, the standard robust/clustering procedure will account for both geographic and time-series dependence within a state.  Since I know from earlier work that one state is about the extent of spatial correlation, this seems reasonable.  Here's what I get:

                  Estimate  Std. Error  t value  Pr(>|t|)    

(Intercept)     2.32e+00  1.1888e+00   1.9514 0.0510065 .  
YEAR            1.29e-02  4.6411e-03   2.7810 0.0054194 ** 
log(Potential)  5.70e-01  1.6938e-01   3.3632 0.0007706 ***
gdd             1.93e-04  2.2126e-04   0.8729 0.3827338    
DD29           -2.48e-02  2.6696e-02  -0.9279 0.3534781    
Prec            1.79e-02  1.2786e-02   1.3974 0.1622882    
I(Prec^2)      -4.94e-04  2.7371e-04  -1.8045 0.0711586 .  
PDay           -6.80e-03  4.9912e-04 -13.6205 < 2.2e-16 ***
DD29:AvgCO2     6.23e-05  6.8565e-05   0.9085 0.3635962    


Oops.  Now most of the weather variables have lost their statistical significance too.  But since I'm explicitly limiting assumed dependence in the cross section within years, now the time trend (YEAR) is significant, and it wasn't when clustering by YEAR.  We probably shouldn't take that significance very seriously, since some kinds of dependence (like technology) probably spans well beyond one state.

Note that this strategy of using large clusters combined with robust SE treatment (canned in STATA, for example) is what's recommended in Angrist and Pischke's Mostly Harmless Econometrics.

There are other ways of dealing with these kinds of problems.  For example, you can use a "block bootstrap" that resamples residuals whole years as a time, which preserves spatial correlation.  This is great in agricultural applications since weather is pretty much IID across years in a fixed locations and we should feel reasonably comfortable that there is little serial correlation.  One can also adapt the method by Conley for panel data.  Soloman Hsiang has graciously provided code here.  In earlier agriculture-related work, Wolfram Schlenker and I generally found that clustering by state gives similar standard errors as these methods.

The overarching lesson is: try it different ways and err on the side of least significance, because it's very easy to underestimate your standard errors and very hard to overestimate them.

And watch out for data errors: these have a way of screwing up both estimates and standard errors, sometimes quite dramatically.

If you had the patience to follow all of this, you might appreciate the footnotes and appendix in our recent comment on Deschenes and Greenstone.

Sunday, May 12, 2013

Laboratory Grown Meat: The Next Green Revolution?

From what I've learned about agriculture over the last 10 years, I'm increasingly skeptical that we'll see another green revolution like the last one.  Crop yields for the major staples appear to be reaching agronomic limits in advanced nations.  While there's still room for improvement in developing nations, a lot of the low hanging fruit seems to have been picked.  And then their are challenges with climate change, which could be beneficial in some places, but likely damaging in most places, and possibly severely damaging.

So, where's a technological optimist to turn?

It seems to me that if we have another green revolution, it's going to look more like this. Right now a 5 oz hamburger, grown in a petri dish rather than scraped off a dead animal, costs a reported $325,000.  That's one expensive burger.  But it is easy to imagine how costs could come down in time.

Anyway, there's obviously a lot of uncertainty about this sort of thing, not the least of which is consumer acceptance.  But in the long run, this kind of technology might do a lot to feed a burgeoning planet in a way that's a lot less environmentally damaging, and depending on your point of view, more humane.


Renewable energy not as costly as some think

The other day Marshall and Sol took on Bjorn Lomborg for ignoring the benefits of curbing greenhouse gas emissions.  Indeed.  But Bjorn, am...