Saturday, June 6, 2015

Statistical Calculations & Numerical Accuracy

This post is for those readers who're getting involved with economic statistics for the first time. Basically, it serves as a warning that sometimes the formulae that you learn about have to be treated with care when it comes to the actual numerical implementation.

Sometimes (often) there's more than one way to express the formula for some statistic. While thee formulae may be mathematically identical, they can yield different numerical results when you go to apply them. Yes, this sounds counter-intuitive, but it's true. And it's all to do with the numerical precision that your calculator (computer) is capable of.

The example I'll give is a really simple one. However, the lesson carries over to more interesting situations. For instance, the inversion of matrices that we encounter when applying the OLS estimator is a case in point. When you fit a regression model using several different statistics/econometrics computer packages, you sometimes get slightly different results. This is because the packages can use different numerical methods to implement the algebraic results that you're familiar with.

For me, the difference between a "good" package and a "not so good" package isn't so much the range of fancy techniques that each offers at the press of a few keys. It's more to do with what's going on "under the hood". Do the people writing the code know how make that code (a) numerically accurate; and (b) numerically robust to different data scenarios?

In that respect, packages range from the excellent to the horrible. (For more on the latter, see my old post, "Beware of Econometricians Bearing Spreadsheets".)

Anyway, back to business. Let's look at a really simple calculation - that for the variance of a set of data. This is a subject that I've blogged about before (see here).

In what follows it doesn't matter if we're dealing with a population variance (as below) or a sample variance (with "n - 1" as the divisor).

Let x* denote the (arithmetic) average of the data. That is,

        X* = [Σ(Xi)] / n.

When you first learned the formula for the variance it was probably in this form:

        Var(X) = (1 / n) Σ [(Xi - X*)2]                       (1)

Then, you almost certainly learned that you could write the formula in the following, absolutely equivalent, way:

        Var(X) = (1 / n) [Σ (Xi2) - nX*2]                   (2)

So, really, it shouldn't matter which version of the variance formula you use, should it?


There are two considerations here - speed and accuracy. You might think that speed is simply not an issue, given the fancy computer that you own. However, what if you were computing the variance when n = 1 million, and you were doing this tens of thousands of times (as in a simulation experiment). In that case, speed might be a bit of an issue. For more on this, and some additional variance formulae, see my previous post and its links to John Cook's excellent blog.

What about numerical accuracy? In this respect there can be an important difference between the results obtained using formulae (1) and (2).

A little bit of R code will illustrate this point. The variable, "var1" is the variance computed according to the formula in (1), and "var2" is the variance computed using equation (2):

We see that the variance of the data is computed to be 1.061049, whichever of the two formulae we use.

Now, let's add a large constant value, namely 1 million, to each of the data values. Adding any constant value doesn't change the variance - it just shifts the mean, right? However, let's see what happens when we apply the two alternative formulae for the variance:

The first formula still gives the same (correct) answer for the variance of the data as before. The value given by the second formula is similar, but not identical - which it should be!

If we increase the value of the constant that we're adding to each data value from 106 to 108, the results really diverge:

If you think that's bad, then look what happens if we increase the value of the constant to 1010 :

A negative variance? Oops!

What's going on here? In a nutshell, the terms that are being squared in (1) are of the order zero in value. On the other hand, in (2) the term I've called "x2" is of order 1020, and so is the square of "xbar". Once we start taking differences of numbers like this, there's an awful lot of potential for rounding error.

Which formula will you be using in the future?

© 2015, David E. Giles


  1. Actually, I won't be using either of those Compare them to var(x):

    > set.seed(100)
    > x<-rnorm(n)
    > set.seed(100)
    > y<-rnorm(n)+10^10
    > var(x)
    [1] 1.062112
    > var(y)
    [1] 1.062111
    > var(x)-var(y)
    [1] 2.528025e-08

    When you add 10^10 to the numbers the bad formula goes negative, the good formula changes by 10^-3, but var(x) changes by 10^-8

    var() is like your var1, but with a two-pass algorithm to compute the mean. For future improvements, Radford Neal has a new floating point summation algorithm that's only 2-4 times slower and is exact in the sense that it returns the closest floating point number to the exact sum.

    Of course, this basically supports your point that you want the people writing the package to know how to get accuracy and to care.