What You Can't Infer from A Regression
by Ajay Shah, ajayshahblog
We are inundated by results of research papers such as:
- Companies which have women directors do better; hedge funds which have female GPs do better [link].
- Parents who have children graduating from college tend to live longer [link].
These correlations are facts. Almost everyone who reads about the result jumps to the conclusion that these have consequences for decisions that we can make. E.g.:
- Since companies which have woman directors do better, let's add women directors to our company and it will fare better.
- Since hedge funds with female GPs do better, let's add women GPs to our hedge fund and it will fare better.
- Since parents that have kids that graduate from college tend to live longer, if we take extra trouble to put our kids through college, then we will live longer.
All these statements are flat wrong.
The fact that there is a correlation absolutely does not imply that there is a causal connection that can be used to make a decision. When x and y are correlated there are numerous possibilities. Maybe x has a causal impact on y. Maybe y has a causal impact on x. Maybe there is a z which has a causal impact on both x and y. We cannot jump to the conclusion about which of these causal pathways is at work when we observe a correlation.
Let's take women on boards as an example. Suppose the evidence shows that firms with women on boards do better. It could just be the case that these are firms with a more socially progressive outlook, and maybe more progressive teams fare better than socially backward teams. If so, a band aid of two women added to the board of a neanderthal bunch is not going to change their outlook.
Let's take parents and children and college. Some kinds of parents have the household environment where children delay gratification, work hard, immerse themselves less into mass culture. Those kinds of kids make it to college and graduate from college. Those kinds of parents live longer. There needs to be no causal connection.
The Problem of Reverse Regression
When we see a linear regression
y = a + b x + e
we jump to the conclusion that changing x by 1 unit will have a causal impact on y of b. There is absolutely no justification for this with observational data! If you want to terrify your young students in a dark alley, do the reverse regression:
x = a + b y + u
The slope will be significant here also. So does changing y have an impact on x or does changing x have an impact on y?
Let's Be More Careful
To take the results of conventional linear econometrics, and imply that there is a causal interpretation, or that someone can use the result to make a better decision, is immoral and unethical.
Many economists are a bit cavalier about these distinctions. The only way to learn about the impact of a treatment is to observe natural or artificial experiments where events happen for exogenous reasons, through which we get to see what happened in the aftermath of the change for near-identical units of observation where one is treated and one is held as a control. The two key tools for this are matching (to figure out what are the near-identical units of observation) and event studies (to figure out what happened after the event).
The old style regressions, where vast datasets are thrown into some matrix algebra, are dangerous and best avoided. No amount of torture by matrices can rescue a bad design. Under observational data, OLS is not BLUE. Everything we have been told in traditional econometrics is suspect when faced with observational data.
Corollary when Working with Indian Firm Data
The emphasis on "near-identical units of observation" has an interesting implication when working with firm data. Imagine that you're doing something involving y and x and firm data. There are some natural experiments where x changes for some firms. You are looking for near-identical firms where the x did not change. This would make possible an event study to figure out the impact of the change.
Suppose the treatment was applied to Reliance Industries. You're out of luck because there is no company in India which is near-identical to Reliance Industries. You have to drop this observation and move on. Reliance Industries is sui generis.
Suppose the treatment was NOT applied to Reliance Industries. No treated company will ever be much like Reliance Industries. You will not find Reliance Industries in your matched dataset.
Suppose you were not careful in this data preparation and Reliance Industries somehow showed up in your dataset. It is quite likely to be an outlier and will mess up your results.
Hence I fear that any regression done with a dataset that contains accounting data for Reliance Industries is wrong.
(This is not a problem with returns data, as the returns data for Reliance is much like that seen for other firms).