Sunday, January 8, 2017

When is a Dummy Variable Not a Dummy Variable?

In econometrics we often use "dummy variables", to allow for changes in estimated coefficients when the data fall into one "regime" or another. An obvious example is when we use such variables to allow the different "seasons" in quarterly time-series data.

I've posted about dummy variables several times in the past - e.g., here

However, there's one important point that seems to come up from time to time in emails that I receive from readers of this blog. I thought that a few comments here might be helpful.

The following variable can legitimately called a "dummy variable":

Di = 1   ;    if a certain condition holds
= 0   ;    otherwise.

The following variable is not a dummy variable:

Ni = 0    ;  if condition A holds
= 1    ;  if condition B holds (where A and B are mutually exclusive conditions)
= 2    ;  otherwise. (Call this condition C, say.)

Let's see what's different about Di and Ni, and then we can consider some further examples.

Let's add  Di as a regressor in a regression model. For simplicity I'll just add it (rather than interact it with another regressor) so that it just shifts the intercept. However, this doesn't affect any of the points that I make below.

So, our model is:

yi = α + β xi + γ Di + ui

where uis the random error term.

If Di = 1, then the intercept is (α + γ); and if  Di = 0, then the intercept is just α. The estimated (positive or negative) "shift" in the intercept is just the estimate of γ that we obtain when we use (say) OLS. The data entirely determine the magnitude of this shift.

On the other hand, suppose that we replace  Di by  Ni in our model:

yi = α + β xi + γ Ni + u.

Now, i condition A holds, then the intercept is α; if condition B holds, then the intercept is (α + γ); and otherwise the intercept is (α + 2γ). Regardless of what the data tell us by way of an estimate for γ, the shift in the estimated intercept from condition A to condition C  is constrained to be twice the shift that we estimate from condition A to condition B.

We've essentially pre-judged part of the answer and imposed it before we even estimated the model! Generally, this is not something that we'd want to do.

You might now ask yourself, does it make sense to use any of the following "dummy variables" as regressors?
• Di = 1,  if condition A holds  ;   Di  = -1,  if condition A does not hold.
• Di = 0,  if condition A holds  ;   Di  = 1,  if condition B holds; Di  = -1, if condition C holds.
(In the second case, conditions A, B, and C are mutually exclusive and totally exhaustive.)