Sunday, November 16, 2014

Orthogonal Regression: First Steps

When I'm introducing students in my introductory economic statistics course to the simple linear regression model, I like to point out to them that fitting the regression line so as to minimize the sum of squared residuals, in the vertical direction, is just one possibility.

They see, easily enough, that squaring the residuals deals with the positive and negative signs, and that this prevents obtaining a "visually silly" fit through the data. Mentioning that one could achieve this by working with the absolute values of the residuals provides the opportunity to mention robustness to outliers, and to link the discussion back to something they know already - the difference between the behaviours of the sample mean and the sample median, in this respect.

We also discuss the fact that measuring the residuals in the vertical ("y") direction is intuitively sensible, because the model is purporting to "explain" the y variable. Any explanatory failure should presumably be measured in this direction. However, I also note that there are other options - such as measuring the residuals in the horizontal ("x") direction.

Perhaps more importantly, I also mention "orthogonal residuals". I mention them. I don't go into any details. Frankly, there isn't time; and in any case this is usually the students' first exposure to regression analysis and they have enough to be dealing with. However, I've thought that we really should provide students with an introduction to orthogonal regression - just in the simple regression situation - once they've got basic least squares under their belts. 

The reason is that orthogonal regression comes up later on in econometrics in more complex forms, at least for some of these students; but typically they haven't seen the basics. Indeed, orthogonal regression is widely used (and misused - Carroll and Ruppert, 1966) to deal with certain errors-in-variables problems. For example, see Madansky (1959).

That got me thinking. Maybe what follows is a step towards filling this gap.

Let's focus on a simple regression model,

                      yi = β0 + β1xi + εi      ;    εi ~ i.i.d. N [0, σ2] .                       (1)

Let sxx, syy, and sxy be the sample variance of x, the sample variance of y, and the sample covariance of x and y, respectively. Specifically, if n is the sample size, and x* is the sample average of the xi's, then
                      sxx = [Σ(xi - x*)2] / (n - 1),   etc.

We all know that the OLS estimator of β1 is, b1 = (sxy / sxx), and the associated estimator of β0 is, b0 = y* - b1x* . These are also the maximum likelihood estimators of the regression coefficients if x is non-random, given the normality of the errors. So, they are "best unbiased", and also consistent, and asymptotically efficient estimators.

Now, recall the that shortest distance between a point and a straight line is obtained if we measure orthogonally (at right angles) to the line. So, let's think about measuring our regression residuals in this way:

(Just click on any of the images to enlarge them.)

In the diagram above, the red line is the fitted regression line; X is just one typical observed data-point; and the line XB is orthogonal to the red line. The ith orthogonal residual is of length di.

If we're fitting the line using OLS, then the residuals that we use are vertical residuals, such as ei in the diagram. However, if we're going to follow up on the idea of fitting the regression line so as to minimize the sum of the squared orthogonal residuals, then the first thing that we need to do is to figure out the expression for the length of an orthogonal residual, di.

We can do this by using a little trigonometry. Of course, most students will deny having ever learned any trigonometry, but they're lying exaggerating. They just can't remember back to Grade 5 - or whenever. So, you just have to prod them a little - figuratively speaking, of course. Let's look at a second diagram:

In this case, a typical observed data-point is at D. Because BD is orthogonal to the red line, the angle BAD = the angle BDC = θ, say.

From the triangle, BAD, we see that sin(θ) = (BD / AD) = di / [xi - (yi - bo0) / bo1]

From the triangle, BCD, we see that cos(θ) = (BD / CD) = di / (bo0 + bo1xi - yi)

Because  cos2(θ) + sin2(θ) = 1, it follows that

           (di2) / (bo0 + bo1xi - yi)2 + (bo12di2) / (bo1xi - yi + bo0)2 = 1 .

           di2 (1 + bo12) = (yi - bo0 - bo1xi)2  ,
           di = (yi - bo0 - bo1xi) / (1 + bo12)½ .

So, to fit the orthogonal regression we need to find the values of bo0 and bo1 that will minimize the function,

           S = Σ[ yi - bo0 - bo1xi]2 / [1 + bo12] ,

where the summation runs from i = 1 to n.

Differentiating S partially with respect to bo0 and bo1 and setting these derivatives equal to zero, we obtain the solutions:
           bo1 = [syy - sxx + ((sxx - syy)2 + 4sxy2)½ ] / [2sxy]  ,               (2)

           bo0 = y* - box*  .                                                                  (3)

(You can easily check that these values locate a minimum of S, by evaluating the (2 x 2) Hessian matrix.)

You can see, from (3), that the regression line, fitted using orthogonal least squares, passes through the sample mean of the data (even though the point (x* , y*) is not likely to be in the sample). This is also a property of the OLS regression line, of course.

Also, the (vertical direction) residuals based on the orthogonal regression estimator sum to zero - as long as we have the intercept in the model. Again, this coincides with the situation with OLS. This is something that you can verify very quickly.

Let's look at a couple of actual examples of orthogonal regression. First, I've generated some artificial data, using (1) with β0 = 1 ; β1 = 2 ; σ = 1; and n = 10,000. Then I've applied both OLS and orthogonal least squares:
Actually, the last above graph was created in EViews simply by "grouping" the x and y series; creating a scatter-plot; and then choosing the option to "add" both the orthogonal least squares and ordinary least squares regression lines. In fact the OLS estimates are b0 =  0.9916, and b1 =  2.0107. It's easy to apply formulae (2) and (3) to find the values of bo0 and  bo1 .

Here's a second example, this one using actual South African household expenditure data made available by Adonis Yatchew (U. of Toronto), at the bottom of his web page. I've fitted a really basic Engel curve for food, of the Working-Leser form:

                    (efi / Ei) = Ln(Ei) + ui  ;    i = 1, 2, ...., n

where ef is expenditure on food; E is total expenditure; and n = 7,358. Here are the results:

In this example there's negligible difference between the OLS and orthogonal regression results.

Just out of interest, how do the (sampling) properties of the orthogonal least squares estimators compare with those of the ordinary least squares estimators of β0 and β1? The latter estimators are best linear unbiased (by the Gauss-Markov Theorem), and with normal errors in (1) they are "best unbiased". They're also weakly consistent.

Looking at the formula for bo1 in (2), we can see right away that this estimator non-linear. That is, we can't express it as a linear function of the random, y, data. Accordingly, from (3), bo0 is not a linear estimator, either. Both estimators are biased in finite samples. However, they can be shown to be weakly consistent (e.g., see Kendall and Stuart, 1961).

It can also be shown that the orthogonal regression estimators of β0 and β1 can be given a maximum likelihood interpretation. Specifically, they are MLEs if both x and y are random, and they follow independent normal distributions with the same variance. (See Carroll and Ruppert, 1996.) However, this is a very special case indeed!

In this post, all that I've discussed is point estimation of the simple linear regression model using orthogonal least squares. There's lots to be said about orthogonal least squares for multiple (possibly non-linear) regression model. That's where the "Total Least Squares" estimator arises. There's also lots to be said about interval estimation and inference. Finally, it will come as no surprise to hear that there's a close connection between orthogonal least squares and principal components analysis.

However, these are matters for future posts.


Carroll, R. J. and D. Ruppert, 1996. The use and mis-use of orthogonal regression in linear errors-in-variables models. American Statistician, 50, 1-6.

Fuller, W. A., 1987. Measurement Error Models. Wiley, New York.

Kendall, M. G. and A. Stuart, 1961. The Advanced Theory of Statistics, Vol. 2. Charles Griffin, London.

Madansky, A., 1959. The fitting of straight lines when both lines are subject to error. Journal of the American Statistical Association, 54, 173-205.

© 2014, David E. Giles


  1. Simply sweet! Great post!

    Could you also suggest a simple real world example where you would consider using orthogonal regression instead of OLS?

    1. Anywhere where both the dependent variable and the regressor may be measured with error.

  2. Very interesting!
    One question: what if I run a principal component factor analysis and then I regress the dependent variable on the predicted scores? Is it the same thing?

  3. Not quite. One connection is that in the simple regression case, where we have 2 variables, X and Y, the fitted orthogonal regression line corresponds to the first principal component.

  4. How to calculate the confidence interval for orthogonal regression?

    1. Quick answer - just bootstrap it. Longer answer - coming in a future post. :-)

    2. Hi Dave! I was wondering if there is a way to calculate prediction intervals around predicted values from orthogonal regression? Thanks

    3. Again - bootstrap them. I'll see if I can do better than this, though.

  5. Is it possible to use orthogonal regression to calculate the imputation of missing values