## Friday, July 6, 2018

### Interpreting Dummy Variable Coefficients After Non-Linear Transformations

Dummy variables - ones that take only the values zero and one - are commonly used as regressors in regression models. I've devoted several posts to discussing various aspects of such variables, notably here, but also here, here, and here.

When the regression model in question is linear, in both the variables and the parameters, the interpretation of coefficient of such a dummy variable is simple. Suppose that the model takes the form:

yi = α + β Di + Σj γj Xji + ε    ;     E(ε) = 0   ;   i = 1, ...., n.                          (1)

The range of summation in the term on the right-hand side of (1) is from 1 to k, if there are k regressors in addition to the dummy variable, D. (There is no loss of generality in assuming a single dummy regressor in what follows, and no further distributional assumptions about the error term will be needed or used.)

As you'll know, if Di = 0, then the intercept coefficient in (1) is just α; and it shifts to (α + β) if Di = 1. It changes by an amount equal to β, and so does the predicted mean value of y. Conversely, this amount changes by -β  if Di changes from 1 to 0. Estimating (1) by OLS will give us an estimate of the effect on y of Di sw from 0 to 1 in value, or vice versa.

But a bit more on estimation issues below!

Another way of interpreting what is going on is to think about the growth rate in the expected value of y that is implied when D changes its value. Setting Di = 0, and then Di = 1, this growth rate is:

g01i = [ (α + β + Σj γj Xji) - (α Σj γj Xji)] / (α Σj γj Xji) = [β /  (α Σj γj Xji)] ,

which you can multiply by 100 to convert it into a percentage rate of growth, if you wish.

Note that this growth rate depends on the other parameters in the model, and also on the sample values for the other regressors

Conversely, when D changes in value from 1 to 0, this growth rate is different, namely:

g10i = - [β / (α + β + Σj γj Xji)]                            (i = 1, ...., n).

In this fully linear model these growth rates offer a somewhat less appealing way of summarizing what is going on than does the amount of change in the expected value of y. The latter doesn't depend on the other parameters of the model, or on the sample values of the regressors.

However, this situation can change very quickly once we move to a regression model that is non-linear, either in the variables or in the parameters (or both).

That's what I want to focus on in this post.

Let's consider some interesting examples that involve common transformations of the dependent variable in a regression model. Apart from anything else, such transformations are often undertaken to make the assumption of a normally distributed error term more reasonable.

Transforming the Dependent Variable

The following discussion is quite extensive. I've deliberately worked through the various examples in some detail, so that you can exactly where the various results come from. However, I'll be uploading a a separate, brief, "Summary of Results" here, shortly.

1.  Logarithmic Dependent Variable

First, suppose that the model, while still linear in the parameters, is non-linear in y, through a (natural) logarithmic transformation -

log(yi= α + β Di + Σj γj Xji + εi     ;     E(ε) = 0   ;   i = 1, ...., n.     ;   if  yi > 0
(2)

While β still represents the shift in the intercept of the model, its value is no longer the shift in the expected value of y. However, the growth rate in the expected value of y now has a very simple expression, as was discussed by Halvorsen and Palmquist (1980), for example.

From (2), we can write:

y= exp[α + β Di + Σj γj Xji + εi] .                                                           (3)

Then, using the same notation as above, after setting Di = 0 and Di = 1, in (3) and setting that  εi to its mean value off zero, we immediately have:

g01 = exp(β) - 1

and

g10 = exp(-β) - 1 .

The "i" subscript on the left-hand side of these last two expressions has been suppressed, as these rates are independent of the sample values. In addition, they depend only on the value of the coefficient of the dummy variable, and not on the other coefficients in the model. That's very convenient, but this alone shouldn't be used to justify using this log-linear model. (See my earlier posts, here and here.)

Also, as we'll see below, this is a rather unusual situation, and we should be careful not to presume that this applies when we use other non-linear transformations of the model's dependent variable.

2.  Box-Cox Transformation

Next, suppose we have a model in which the basic Box and Cox (1964) transformation has been applied to the dependent variable:

yi* = [(yiλ - 1) / λ]    ;     if  λ ≠ 0

yi* = log(yi)              ;    if  λ = 0  .

Clearly, the case where λ = 0 has been dealt with above, so let's focus on λ ≠ 0.

The regression model that we'll consider is:

yi* = α + β Di + Σj γj Xji + ε    ;     E(ε) = 0   ;   i = 1, ...., n.                 (4)

This model is non-linear in the parameter λ, and the interpretation of the impact of the dummy variable on y is not straightforward.

Note that we can re-write (4) as:

yi = [1 + λ(α + β Di + Σj γj Xji + εi )] 1/λ   .                                             (5)

Once again, using the same notation as above, after assigning Di = 0 and Di = 1 in (5), and setting the error term to (its mean value of) zero, we immediately have:

g01i = {[1 + λ(α + β + Σj γj Xji)] / [1 + λ(α + Σj γj Xji)] }1/λ - 1

and

g10i = {[1 + λ(α + Σj γj Xji)] / [1 + λ(α + β + Σj γj Xji)]}1/λ - 1     ;   i = 1, 2, ...., n.
(6)

As in the case of the fully linear model, these rates of growth associated with the dummy variable are both data-dependent, and they depend on all of the model's parameters.

3.  Box-Cox Transformation With a Location Shift

A major disadvantage of the basic Box-Cox transformation is that it can't be applied to data that take negative values. One way of adapting the Box-Cox transformation to take this into account is to introduce a second parameter - a location parameter that effectively allows us to "shift" the origin of the data.

With this modification, we have:

(i)     yi* = [((yi + λ2λ1 - 1) / λ1]    ;     if  λ1 ≠ 0
(ii)    yi* = log(yi+ λ2)                     ;     if  λ1 = 0, and  yi > - λ2 .

Again, our regression model is:

yi* = α + β Di + Σj γj Xji + ε    ;     E(ε) = 0   ;   i = 1, ...., n.                 (7)

The model is non-linear in the two parameters
λ1 and λ2, and again the interpretation of the impact of the dummy variable on the expected value of y isn't trivial.

Substituting for yi*in (7), setting the error to its mean value of zero, and solving for yi, we can re-write the model as:

(i)    yi = [1 + λ(α + β + Σj γj Xji)]1/λ1 - λ2   ;   if  λ1 ≠ 0

(ii)   yi = exp[α + β Di + Σj γj Xji-  λ2                 ;     if  λ1 = 0, and  yi > - λ2 .           (8)

The expressions for the growth rates associated with "switching" the dummy variable's values are:

(i)        g01i {[1 + λ(α + β + Σj γj Xji) -  λ2] / [1 + λ(α + Σj γj Xji) -  λ2}1/λ1  - 1

g10i = {[1 + λ(α + Σj γj Xji) - λ2] / [1 + λ(α + β + Σj γj Xji) - λ2]}1/λ1  - 1
;     if  λ1 ≠ 0
(ii)       g01i = {[exp(α + β + Σj γj Xji-  λ2] / [exp(α + Σj γj Xji-  λ2]} - 1

g10i = {[exp(α + Σj γj Xji-  λ2] / [exp(α + β + Σj γj Xji-  λ2]} - 1
;     if  λ= 0 ; and  yi >  - λ2 .
(i = 1, 2, ...., n).
(9)

Of course, if we set λ2 = 0 in these growth rate expressions, we get the growth rates associated with the regular Box-Cox model, and the semilogarithmic model, given in sections 2 and 1 above.

Note that this modified Box-Cox transformation comes with a "cost". There are now two additional parameters that have to be estimated, along with the regression coefficients.

4.  Inverse Hyperbolic Sine Transformation

A more general transformation that has been suggested for dealing with data which may be negative, as well as positive, is the inverse hyperbolic sine function.

If this transformation is applied to the yi data, we have

yi* = sinh-1(yi) = log[y+ √(1 + yi2)] .                                                  (10)

As before, our regression model is:

yi* = α + β Di + Σj γj Xji + ε    ;     E(ε) = 0   ;   i = 1, ...., n.                 (11)

Notice that in this case we don't have to add additional parameters to the model's specification, which is advantageous relative to the situation with the modified Box-Cox transformation.

Substituting (11) in (10), setting the error term to zero, and solving for yi, we get:

yi = 0.5 {exp( α + β Di + Σj γj Xji) - exp[ -(α + β Di + Σj γj Xji)]} .           (12)

Then, if we set Di = 0 and Di = 1 in (12), and compute the implied growth rates in (the expected value of) yi, the results that we get are:

g01i = {exp(α + Σj γj Xji)[exp(β ) - 1]  - exp[ -(α + Σj γj Xji)][exp(-β )1]} / B01i

and

g10i = {exp(α + Σj γj Xji)[1 - exp(β )] - exp[ -(α + Σj γj Xji)][1 - exp(-β )]} / B10i

where:

B01= exp(α + Σj γj Xji) - exp[ -(α + Σj γj Xji)]

B10= exp( α + β + Σj γj Xji) - exp[ -(α + β + Σj γj Xji)]   ;   i = 1, 2, ....., n.
(13)
Yet again, the growth rates depend on the values of all of the model's parameters, as well as on the observed X values.

5. Yeo-Johnson Transformation

As a final example, let's look at the Yeo-Johnson (2000) power transformation of the dependent variable. This takes the following form:

(i)    yi* = [(yi +1) λ - 1)] / λ    ;     if  λ ≠ 0  and y≥ 0

(ii)    yi* = log(yi +1)                ;     if  λ = 0  and  y≥ 0

(iii)    yi* = - [(1 - yi)(2 - λ) - 1)] / (2 - λ )   ;     if  λ ≠ 2  and yi < 0

(iv)    yi* = - log(1 - yi)                ;     if  λ = 2  and yi < 0

(You can see that the first two cases for this transformation relate to the modified Box-Cox and Box-Cox transformations, respectively.)

Writing our regression model as in (11), but with this new definition of yi*, we can proceed in the same manner as above:
• We substitute each of the expressions for yi*, in turn, into (11);
• We set the error term to zero; solve for yitself;
• We evaluate the resulting expressions when the dummy variable takes its two possible values;
• We compute the two growth rate expressions.
The resulting growth rate expressions associated with the "switching" of the dummy variable  between values of "0" and "1", or vice versa, are (for i = 1, 2, ...., n):

(i)       g01i = {[1 + λ(α + β + Σj γj Xji)]1/λ - 1 } / {[1 + λ(α + Σj γj Xji)]1/λ - 1 } -1

g10i = {[1 + λ(α + Σj γj Xji)]1/λ - 1 } / {[1 + λ(α β + Σj γj Xji)]1/λ - 1 } -1

;     if  λ ≠ 0  and y≥ 0

(ii)    g01i = {[exp(α + β + Σj γj Xji) - 1] / [exp(α + Σj γj Xji) - 1]} - 1

g10i = {[exp(α + Σj γj Xji) - 1] / [exp(α + β + Σj γj Xji) - 1]} - 1

;     if  λ = 0  and  y≥ 0

(iii)   g01i {[1 - [(2 - λ)(α + β + Σj γj Xji) +1]1/(2 - λ)} / {[1 - (2 - λ)(α + Σj γj Xji)]1/(2 - λ)}  - 1

g10i = {[1 - [(2 - λ)(α + Σj γj Xji) +1]1/(2 - λ)} / {[1 - (2 - λ)(α + β + Σj γj Xji)]1/(2 - λ)}  - 1

;     if  λ ≠ 2  and y< 0
(iv)    g01i = {1 - exp[-(α + β + Σj γj Xji)]} / {1 - exp[-(α + Σj γj Xji)]} - 1

g10i = {1 - exp[-(α Σj γj Xji)]} / {1 - exp[-(α + β + Σj γj Xji)]- 1

;     if  λ = 2  and y< 0
(14)

Of course, all of these growth rates depend, yet again, on the values of all of the parameters in the model, and vary according to the sample values taken by the regressors.

The Naive Level-Break Model

As a bit of an aside, let's consider the case where the model includes an intercept and a dummy variable, but no other regressors. In other words, the dependent variable has just a "breaking (mean) level":

yi* = α + β Di ε    ;     E(ε) = 0   ;   i = 1, ...., n.                       (15)

You might be thinking, "this isn't a very interesting/realistic model", and it's not! But bear with me - you'll see in a moment why I want to talk about this rather special situation..

Note the following:
• If  yi* =  yi , so that the model is fully linear, β itself has the same interpretation as before, and (unsurprisingly) the growth rates no longer depend on the sample values. Specifically, we then have  g01 = [β / α] , and   g10 = - [β / (α + β)].
• If  yi* =  log(yi), the expressions for the two growth rates remain the same as those given above for the case where other regressors enter the model. This is a very special result!
• In all of the other transformations that we've considered, the growth rates simplify in obvious ways. They're no longer the same as the ones we derived earlier.
Here's why this matters.

For all of these other transformations (2 to 5, above), if you derive the growth rates using just the level-break model, (15), you obtain formulae that are wrong, if what you're really interested in is a model that includes other regressors.

Ouch!

As a case in point, an unpublished paper by Lachowska (2017) falls into precisely this trap in the context of the inverse hyperbolic sine transformation! A recent paper by Bellemare and Wichman (2018) that's discussed in one of Marc Bellemare's blog posts, uses Lachowska's incorrect result in one part of their own discussion. An unwary reader might easily infer too much from that discussion.

Some Estimation Issues

All of the growth rates derived above are expressed in terms of the true, unknown, values of certain parameters (and in many cases in terms of the sample values of the non-dummy regressors). Estimating the implied growth rates involves inserting estimates of these parameters into the various growth rate formulae.

This raises a number of issues. While these issues aren't my primary concern here, some comments are certainly in order.

There's an established literature concerning the properties of various estimators of the growth rates implied by the dummy variable in the semilogarithmic model. For instance, see Kennedy (1981), Giles (1982), and my recent blog post, here. Keep in kind that these results require an assumption that the model's error term is normally distributed.

The papers by Burbidge et al. (1988) and MacKinnon and Magee (1990) provide some insights into various aspects of inference in general in the context of the inverse hyperbolic sine transformation. Lachowska's (1997) results regarding dummy variable growth rates after this transformation are correct for the (uninteresting) level-break model, (13), but incorrect for the full regression model, (13).

Take-Aways

There are several take-away points from this post, including:
• We must be very careful when interpreting of the impact/role of a dummy variable in a regression model where the dependent variable has been transformed in some non-linear way.
• The correct interpretation depends crucially on the specific transformation that's been used.
• Often, it's helpful to express this interpretation in terms of the implied growth rate (or percentage change) implied for the (mean of the) dependent variable when the dummy variable "switches" between its value of zero and one.
• In most cases, these growth rates depend on all of the unknown parameters in the regression model, as well as on the sample-specific values of the regressors.
• Measuring these growth rates involves estimating the model's unknown parameters and assigning values to the regressors. The latter could be achieved by using sample mean values.
• Unless the parameter estimates that are "inserted" into the various growth rate formulae are chosen very judiciously, there are no guarantees with regard to the quality of the resulting estimated growth rates in small to moderate-sized samples.
I'm sure that readers would like to see some empirical applications of some of these results - these will becoming up in due course!

And don't forget - I'll be uploading a separate, brief, "Summary of Results" shortly.

References

Bellemare, M.F. and C. Wichman, 2018. Elasticities and the inverse hyperbolic sine transformation. Mimeo., Department of Applied Economics, University of Minnesota.

Box, G. E. P. and D. R. Cox, 1964. An analysis of transformations. Journal of the Royal Statistical Society, B, 26, 211–252.

Burbidge, J. B., L. Magee, and L. Robb, 1988. Alternative transformations to handle extreme values of the dependent variable. Journal of the American Statistical Association, 83, 123-127.

Giles, D. E., 1982. The interpretation of dummy variables in semilogarithmic equations: Unbiased estimation. Economics Letters, 10, 77-79.

Kennedy, P. E., 1981. Estimation with correctly interpreted dummy variables in semilogarithmic equations.  American Economic Review, 71, 801.

Halvorsen, R. and R. Palmquist, 1980. The interpretation of dummy variables in semilogarithmic equations. American Economic Review, 70, 474–475.

Lachowska, M., 2017. A note on the approximate interpretation of dummy variable coefficients in inverse hyperbolic sine regressions. Mimeo., W. E. Upjohn Institute for Employment Research.

MacKinnon, J. G. and L. Magee, 1990. Transforming the dependent variable in regression models. International Economic Review, 31, 315-339.

Yeo, I-K. and Johnson, R., 2000. A new family of power transformations to improve normality or symmetry. Biometrika, 87, 954-959.

1. Marta Lachowska emailed me today as follows:
"I wanted to say that the paper you refer to on your blog (http://davegiles.blogspot.com/2018/07/interpreting-dummy-variable.html)—Lachowska (2017)—was a draft that I circulated for comments. When I learned that the approximation worked poorly in general—back in April—, I took it down from my website. I don't stand by the paper, but unfortunately it may still come up when people search for the title."

Thanks for clarifying, Marta. I should point out that the link to your paper still worked fine when I tested it on the day this post went out. It seems that you have taken it down since then.

2. To clarify, I deleted the link on my research webpage back in April. At that point, the paper no longer appeared on my research webpage. I naively believed that, because the link was gone, so was the uploaded PDF. But as Dave Giles has said to me in an email, links to files and the files themselves are two different things. Clearly, the URL with the PDF remained available in my domain for search engines to find, as I discovered when I clicked on the link in Dave's blog. At that time, I contacted Weebly to have the PDF removed.

3. To clarify, I deleted the link on my research webpage back in April. At that point, the paper no longer appeared on my research webpage. I mistakenly believed that, because the link was gone, so was the uploaded PDF. Clearly, the URL with the PDF remained available in my domain for search engines to find, as I discovered when I clicked on the link in Dave's blog. At that time, I contacted the website provider to have the PDF removed.
Marta Lachowska