I'd like to emphasize Cochrane's tips for empirical work:
These tips verge on “how to do empirical work” rather than just “how to write empirical work,” but in the larger picture “doing” and “writing” are not that different.
What are the three most important things for empirical work? Identification, Identification, Identification. Describe your identification strategy clearly. (Understand what it is, first!) Much empirical work boils down to a claim that “A causes B,” usually documented by some sort of regression. Explain how the causal effect you think you see in the data is identified.
1. Describe what economic mechanism caused the dispersion in your right hand variables. No, God does not hand us true natural experiments very often.
2. Describe what economic mechanism constitutes the error term. What things other than your right hand variable cause variation in the left hand variable?
3. Hence, explain why you think the error term is uncorrelated with the right hand variables in economic terms. There is no way to talk about this crucial assumption unless you have done items 1 and 2!
4. Explain the economics of why your instruments are correlated with the right hand variable and not with the error term.
5. Do you understand the difference between an instrument and a control? In regressing y on x, when should z be used as an additional variable on the right hand side and when should it be an instrument for x?
6. Describe the source of variation in the data that drives your estimates, for every single number you present. For example, the underlying facts will be quite different as you add fixed effects. With firm fixed effects, the regression coefficient is driven by how the variation over time within each firm. Without firm fixed effects, the coefficient is (mostly) driven by variation across firms at a moment in time.
7. Are you sure you’re looking at a demand curve, not a supply curve? As one way to clarify this question, ask “whose behavior are you modeling?”
Example: Suppose you are interested in how interest rates affect housing demand, so you run the number of new loans on interest rates. But maybe when housing demand is large for other reasons, demand for mortgages (and other borrowing demand correlated with demand for mortgages) drives interest rates up. You implicitly assumed stable demand, so that an increase in price would lower quantity. But maybe the data are generated by a stable supply, so that increased demand raises the price, or some of both. Are you modeling the behavior of house purchasers or the behavior of savers (how savings responds to interest rates)?
8. Are you sure causality doesn’t run from y to x, or from z to y and x simultaneously? Think of the obvious reverse-causality stories. Example: You can also think about the last example as causality: Do interest rates cause changes in housing demand or vice versa (or does the overall state of the economy cause both to change)?
9. Consider carefully what controls should and should not be in the regression. Most papers have far too many right hand variables. You do not want to include all the “determinants” of y on the right hand side.
(a) High R2 is usually bad — it means you ran left shoes = α + β right shoes +γprice + error. Right shoes should not be a control!
(b) Don’t run a regression like wage = a + b education + c industry + error. Of course, adding industry helps raise the R2, and industry is an important other determinant of wage (it was in the error term if you did #2). But the whole point of getting an education is to help people move to better industries, not to move from assistant burger-flipper to chief burger-flipper.
Give the stylized facts in the data that drive your result, not just estimates and p values. For a good example, look at Fama and French’s 1996 “Multifactor explanations.” In the old style we would need one number: the GRS test. Fama and French show us the expected returns of each portfolio, they show us the beta of each portfolio, and they convince us that the pattern of expected returns matches the pattern of betas. This is the most successful factor model of the last 15 years ...even though the GRS test is a disaster! They were successful because they showed us the stylized facts in the data.
Explain the economic significance of your results. Explain the economic magnitude of the central numbers, not just their statistical significance. Especially in large panel data sets even the tiniest of effects is “statistically significant.” (And when people show up with the usual 2.10 t statistic in large panel data sets, the effect is truly tiny!)
Of course, every important number should include a standard error.This kind of empirical logic was standard-fare training at Berkeley many years ago. But it is still rare among environmental economists and almost nonexistent among agricultural economists. Many of the standard fare "classics" in ag. econ have undefined errors in their models, obviously endogenous right-hands-side variables, and lack any discussion of the essential comparisons underlying the identification strategy. I won't even get started on instrumental variables.... The prevailing style seems to be one in which assumptions are typically stated but rarely defended rhetorically. Test statistics are often given but the weight of the evidence is rarely shown.
A common misnomer is that to worry about these issues somehow pits reduced-form or quasi-experimental empiricists against structural modelers. This is obviously false given Cochrane--a pretty structural guy--is the one making these points.
These are just basic tenets of good empirical work with observational data. As a new Associate Editor of AJAE (one of 20 or more), I plan to do my own tiny part in trying to enforce these tenets.
Late Update: The link on Mankiw's site is old. Here's (PDF) a link to all of Cochrane's writing tips.