Tuesday, April 14, 2015

Regression Coefficients & Units of Measurement

A linear regression equation is just that - an equation. This means that when any of the variables - dependent or explanatory - have units of measurement, we also have to keep track of the units of measurement for the estimated regression coefficients.

All too often this seems to be something that students of econometrics tend to overlook.

Consider the following regression model:

               yi = β0 + β1X1i + β2x2i + β3x3i + εi    ;    i = 1, 2, ...., n                   (1)

where y and x2 are measured in dollars; x1 is measured in Kg; and x3 is a unitless index.

Because the term on the left side of (1) has units of dollars, every term on the right side of that equation must also be expressed in terms of dollars. These terms are β0, (β1x1i), (β2x2i), (β3x3i), and εi.

In turn, this implies that β0 and β3 have units which are dollars; the units of β1 are ($ / Kg); and β2 is unitless. In addition, the error term, ε, has units that are dollars, and so does its standard deviation, σ.

What are some of the implications of this?

The "standard errors" associated with each OLS estimate of the β's also have units. They're the same as for the β's themselves. (The t-ratios, of course, are always unitless.) Also, strictly speaking, when we report confidence intervals for any of the β's these units of measurement should also be reported. 

Another important implication is that we should be very careful indeed when comparing the numerical values of estimated regression coefficients, even within the same model. Suppose that the OLS point estimates of β1 and β3 in (1) are 1.0 and 3.0 respectively. Does this mean that changes in x3 have three times the impact on y, as compared with changes in x1? Certainly not!

Remember, that value of 1.0 is actually 1.0 $ per Kg, whereas the 3.0 value is actually $3.0. You can't compare magnitudes that are in different units, unless this difference is properly taken into account.

There's also a slightly more subtle point that goes beyond this simple arithmetic.

To take the discussion a step further, suppose that we added another regressor to equation (1): x4, with units of dollars, and a coefficient of β4. Suppose, too, that the OLS estimates of β2 and β4 are 2.0 and 4.0 respectively. Both of these numbers are unitless, so we can legitimately say that the point estimate of β4 is twice as big as the point estimate of β2.

However, can we say that "the impact of x4 on y is twice as big as the impact of x2 on y"? I know that it's tempting to do so!

To answer this question, first we have to decide what it actually means!

Specifically, we have to decide what sort of "change" in the variables we're talking about when we use the term "impact".

It's true that a one-unit (dollar) change x4 leads to a change in the dollar value of y that is twice the size of the dollar change in y that occurs when x2 changes by one unit. However, consider the following point.

Suppose that the sample size is n = 6, and that the sample values for x2 and x4 are x2: ${1, 2, 3, 4, 5, 6.45}; and x4: ${0.01, 0.02, 0.03, 0.04, 0.05, 0.0645}.The sample averages and standard deviations are 3.575 and 1.9959 for x2, and 0.03575 and 0.019959 for x4. So, a one-unit change in x2 is a relatively modest change, in the sense that it's a change that's equivalent to approximately half a standard deviation. On the other hand, a one-unit change in x4 is quite substantial, in the sense that it's a change of roughly 50 standard deviations!

If the y variable has a sample standard deviation of 1.0, then the interpretation of the OLS estimates (2.0 and 4.0) of β2 and β4 is as follows. Ceteris paribus, a change of half a standard deviation in x2 will lead to a 2 standard deviation change in y; while a change of 50 standard deviations in x4 will lead to a 4 standard deviation change in y.

Now which coefficient estimate do you think is the "larger" - that of β2 or that of β4?

This suggests an alternative way of thinking about the "impacts" of x2 and x4 on y. We might measure these impacts in terms of changes in the variables after they have been scaled to take into account the different sample variations in the data.

Let's illustrate this by estimating some OLS regression models using EViews 9. The workfile and the (totally artificial) data are available on the code and data pages for this post, respectively.

Here are my OLS results, using the "raw" data:



Then, if I select the "View" tab, and choose the options, "Coefficient Diagnostics" and "Scaled Coefficients", from the drop-down menus I get:

According to the EViews manual, the "Standardized Coefficients" in the table above are ".... the point estimates of the coefficients standardized by multiplying by the standard deviation of the dependent variable divided by the standard deviation of the regressor." 

That's absolutely correct, but an alternative way of describing them is that they are the OLS point estimates that we get if divide y by sy, x1 by sx1, and x2 by sx2, where sy is the sample standard deviation of y, and sxj is the sample standard deviation of xj; j = 1, 2. Then, when we estimate the modified model by OLS, and the estimated regression coefficients are the "Standardized Coefficients". Let's verify this using, using EViews:


The estimated coefficients for the (non-constant) regressors are identical to the corresponding "standardized coefficients" that we saw above.

Equvialently, we can estimate the model by OLS after scaling the regressors x1 and x2 by multiplying them by (sy / sx1) and (sy / sx2) respectively. In addition, we'll multiply the intercept "variable" (the series of "ones") by sy. Then, we'll estimate the regression model, with y itself as the dependent variable:


Once again, the estimated coefficients of x1 and x2 match the earlier standardized coefficients.

Finally, we can view the standardized coefficients as what we get if literally standardize every variable in the regression model by subtracting the sample mean and dividing by the corresponding sample standard deviation. (The intercept vanishes from the model once we subtract the mean of the column of "ones".) In Eviews, with "ybar" denoting the sample average of the y variable, etc., we get:


Personally, it's this last version of the model that I prefer to think of as the basis for the so-called standardized coefficients. In addition, Andrew Gelman (2008) points out that subtracting the sample means to centre the data at zero makes interpretation of (any) interaction effects easier.

Beta coefficients were discussed briefly in Art Goldberger's Econometric Theory  (1964; pp. 197-198). He observed that, "Although they are extensively used in psychological statistics, standardized variables and beta coefficients are rarely used in econometrics."

To this, I'd add:
  • They're also used in other disciplines in the social sciences.
  • Other econometrics packages (such as Stata), report beta coefficients.
  • This topic doesn't seem to get much attention (if any) in more recent econometrics text books.
  • Similar results apply if the regression model is nonlinear in the parameters.
My bottom line - I don;t actually use beta coefficients at all. I prefer to think in terms of marginal effects measured in terms of the original data units.

© 2015, David E. Giles

4 comments:

  1. Very interesting. But how do you interpret the estimated coefficients when the data is I(1), in which case subtracting the mean in nonsense?

    ReplyDelete
    Replies
    1. You're subtracting the sample mean of the data, not the population mean.

      Delete
  2. Dear Dave ! how we can forecast annual values fro the future period in Eviews-9.
    Please write a complete blog on forecasting techniques in E-Views by giving examples and showing E-Views work files

    ReplyDelete
  3. Very clearly explained. Thank you.

    ReplyDelete

Note: Only a member of this blog may post a comment.