There are all sorts of good reasons why we sometimes transform the dependent variable (y) in a regression model before we start estimating. One example would be where we want to be able to reasonably assume that the model's error term is normally distributed. (This may be helpful for subsequent finite-sample inference.)
If the model has non-random regressors, and the error term is additive, then a normal error term implies that the dependent variable is also normally distributed. But it may be quite plain to us (even from simple visual observation) that the sample of data for the y variable really can't have been drawn from a normally distributed population. In that case, a functional transformation of y may be in order.
So, suppose that we estimate a model of the form
f(yi) = β1 + β2 xi2 + β3 xi3 + .... + βk xik + εi ; εi ~ iid N[0 , σ2] . (1)
where f(.) is usually a 1-1 function, so that f-1(.) is uniquely defined. Examples include f(y) = log(y), (where, throughout this post, log(a) will mean the natural logarithm of 'a'.); and f(y) = √(y) (if we restrict ourselves to the positive square root).
Having estimated the model, we may then want to generate forecasts of y itself, not of f(y). This is where the inverse transformation, f-1(y), comes into play.
In a recent post, and one quite some time back, I addressed this forecasting "re-transformation" issue for the two examples of f(y) given above.
Of course, this is quite a general problem, and there's a substantial statistics literature associated with it. I plan to talk about that literature in an upcoming blog post.
For now, I want to extend my earlier discussion to another common transformation that can be used even when the sample of y data includes values that are zero or negative. I'm referring to the Inverse Hyperbolic Sine (IHS) function, which is defined as w = sinh-1(y) = log[y + √(1 + y2)].
The IHS function is monotonic increasing, and its (unique) inverse function is the Hyperbolic Sine function itself: y = f-1(w) = sinh(w) = 0.5 [exp(w) - exp(-w)]. Here's what these two functions look like:\
If the model has non-random regressors, and the error term is additive, then a normal error term implies that the dependent variable is also normally distributed. But it may be quite plain to us (even from simple visual observation) that the sample of data for the y variable really can't have been drawn from a normally distributed population. In that case, a functional transformation of y may be in order.
So, suppose that we estimate a model of the form
f(yi) = β1 + β2 xi2 + β3 xi3 + .... + βk xik + εi ; εi ~ iid N[0 , σ2] . (1)
where f(.) is usually a 1-1 function, so that f-1(.) is uniquely defined. Examples include f(y) = log(y), (where, throughout this post, log(a) will mean the natural logarithm of 'a'.); and f(y) = √(y) (if we restrict ourselves to the positive square root).
Having estimated the model, we may then want to generate forecasts of y itself, not of f(y). This is where the inverse transformation, f-1(y), comes into play.
In a recent post, and one quite some time back, I addressed this forecasting "re-transformation" issue for the two examples of f(y) given above.
Of course, this is quite a general problem, and there's a substantial statistics literature associated with it. I plan to talk about that literature in an upcoming blog post.
For now, I want to extend my earlier discussion to another common transformation that can be used even when the sample of y data includes values that are zero or negative. I'm referring to the Inverse Hyperbolic Sine (IHS) function, which is defined as w = sinh-1(y) = log[y + √(1 + y2)].
The IHS function is monotonic increasing, and its (unique) inverse function is the Hyperbolic Sine function itself: y = f-1(w) = sinh(w) = 0.5 [exp(w) - exp(-w)]. Here's what these two functions look like:\
I discussed the IHS transformation (among others)in a different context in an earlier post titled, "Interpreting Dummy Variable Coefficients After Non-Linear Transformations". Recently, Marc Bellamare has also posted about his work relating to the IHS function in his blog.
The use of the IHS transformation in the regression context was considered as far back as Johnson (1949). His SU family of transformations generalizes the IHS function to include both scale and location parameters. In the econometrics literature, the IHS transformation of y has been discussed by Burbidge et al. (1988) and by MacKinnon and Magee (1990), among others.
Now, let's turn to the main result that I want to consider here. This result is implicit in some of the material in the Appendix of MacKinnon and Magee's paper, but their presentation is less straightforward. So, here we go .........
Let's go back to the model in (1), where the dependent variable is subject to an IHS transformation:
wi = sinh-1(yi) = β1 + β2 xi2 + β3 xi3 + .... + βk xik + εi ; εi ~ iid N[0 , σ2] . (2)
Let's go back to the model in (1), where the dependent variable is subject to an IHS transformation:
wi = sinh-1(yi) = β1 + β2 xi2 + β3 xi3 + .... + βk xik + εi ; εi ~ iid N[0 , σ2] . (2)
Looking at the expressions for the Hyperbolic Sine and IHS functions, you'd be forgiven for expecting that figuring out the appropriate way to forecast y itself is going to be pretty messy. In fact, that's not the case, thanks to the relationship between the normal, and the log-normal distributions.
From (2), we see immediately that
sinh(wi) = yi = sinh(β1 + β2 xi2 + β3 xi3 + .... + βk xik + εi)
= 0.5 [exp(wi) - exp(-wi)] .
Now, notice that wi is a Normal random variable, with a mean of μi = β1 + β2 xi2 + β3 xi3 + .... + βk xik , and a variance of σ2.
So, exp(wi) follows a log-Normal distribution, with a mean of exp(μi + ( σ2/ 2)). Similarly, exp(-wi) follows a log-Normal distribution, with a mean of exp(-μi + ( σ2/ 2)).
From (2), we see immediately that
sinh(wi) = yi = sinh(β1 + β2 xi2 + β3 xi3 + .... + βk xik + εi)
= 0.5 [exp(wi) - exp(-wi)] .
Now, notice that wi is a Normal random variable, with a mean of μi = β1 + β2 xi2 + β3 xi3 + .... + βk xik , and a variance of σ2.
So, exp(wi) follows a log-Normal distribution, with a mean of exp(μi + ( σ2/ 2)). Similarly, exp(-wi) follows a log-Normal distribution, with a mean of exp(-μi + ( σ2/ 2)).
Immediately, E(yi) = 0.5 [exp(μi + ( σ2/ 2)) - exp(-μi + ( σ2/ 2))]
Using the same reasoning as was applied in my two earlier related posts (linked above), we see right away that if we want to forecast the y variable itself, it won't be sufficient to simply use sinh(w*i), where the "*" denotes a "fitted" value. That is, w*i = b1 + b2 xi2 + b3 xi3 + .... + bk xik , and the bi's are the (OLS, say) estimates of the βi's.
It's easy to show that now the appropriate forecast of y is [exp(s2 /2) sinh(θ*w*i) / θ*], where θ* is a suitable estimator of θ. This, and the estimates of the other parameters in the regression model will usually be obtained by maximum likelihood estimation. As a detail, this means that "(n - k)" will be replaced by "n" in the definition of s2.
Finally, notice that the normality of the regression error term played a key role in above analysis. If we want to relax this normality requirement, then Duan's (1983) non-parametric "smearing estimator" can be used to generate forecasts. Interestingly, the smearing estimator yields a result that's identical to the one discussed in my earlier post when y is subjected to a square root transformation. However, it leads to a different result in the case of a logarithmic transformation or the IHS transformation.
However, more on this in a later post.
© 2019, David E. Giles
= sinh(μi) exp( σ2/ 2).
Using the same reasoning as was applied in my two earlier related posts (linked above), we see right away that if we want to forecast the y variable itself, it won't be sufficient to simply use sinh(w*i), where the "*" denotes a "fitted" value. That is, w*i = b1 + b2 xi2 + b3 xi3 + .... + bk xik , and the bi's are the (OLS, say) estimates of the βi's.
Having formed the "crude" prediction, sinh(w*i), we then need to multiply the result by exp(s2 /2), where s2 is the usual estimator of σ2. That is, s2 is the sum of the squared residuals that we obtain when we estimate (2), divided by (n - k).
Interestingly, this is the same "adjustment" to the crude forecast that is needed in the case of the logarithmic transformation. At least that makes it easy to remember!
Of course, as with logarithmic case, failure to apply the adjustment will result in forecasts that are distorted downwards. (Recall that the exponential of any positive number exceeds one in value.)
All of this extends almost trivially to the case where the IHS function includes a scale parameter (θ) that has to be estimated along with the other parameters in the model. This is the situation that's considered by Burbidge et al. (1988), and MacKinnon and Magee (1990). In that case the IHS transformation takes the form:
w = sinh-1(θy) / θ = log[θy + √(1 + (θy)2)] / θ .
w = sinh-1(θy) / θ = log[θy + √(1 + (θy)2)] / θ .
It's easy to show that now the appropriate forecast of y is [exp(s2 /2) sinh(θ*w*i) / θ*], where θ* is a suitable estimator of θ. This, and the estimates of the other parameters in the regression model will usually be obtained by maximum likelihood estimation. As a detail, this means that "(n - k)" will be replaced by "n" in the definition of s2.
Finally, notice that the normality of the regression error term played a key role in above analysis. If we want to relax this normality requirement, then Duan's (1983) non-parametric "smearing estimator" can be used to generate forecasts. Interestingly, the smearing estimator yields a result that's identical to the one discussed in my earlier post when y is subjected to a square root transformation. However, it leads to a different result in the case of a logarithmic transformation or the IHS transformation.
However, more on this in a later post.
References
Burbidge, J. B., L. Magee, & L. A. Robb, 1988. Alternative transformations to handle extreme values of the dependent variable. Journal of the American Statistical Association, 83, 123-127.
Duan, N., 1983. Smearing estimate: A nonparametric retransformation method. Journal of the American Statistical Association, 78, 605-610.
Johnson, N. L., 1949. Systems of frequency curves generated by methods of translation. Biometrika, 36, 149-176.
MacKinnon, J. G. & L. Magee, 1990. Transforming the dependent variable in regression models. International Economic Review, 31, 315-339.
"If the model has non-random regressors, and the error term is additive, then a normal error term implies that the dependent variable is also normally distributed." I suppose under these conditions, each observation of Y (Y_1, Y_2, ..., Y_n) is normally distributed, but the distribution of Y_i generally has a different expected value than Y_j for i<>j. Therefore, we cannot say that all Ys are coming from the same normal distribution, so we would not expect the sample of Ys to be normal. Does that make sense?
ReplyDeleteNope. Any linear transformation of a normal random vector is also a normally distributed random vector. This follows immediately from an inspection of the moment generating function. That's all that I'm claiming, and all that's needed here.
Delete