## Thursday, May 23, 2013

### Actually Computing the Sample Variance!

I always enjoy the posts from John Cook on his The Endeavour blog. John's a knowledgable guy and there's a lot on his blog that's of interest to econometricians. Take a look for yourself!

Back in 2008, John had a post that's relevant to something I've been blogging about recently. It also reminded me of some important issues associated with computation - issues that we used to worry about a great deal in the bad old days of "hand calculations", and computers with short word-lengths and very limited memory

One thing that needs to be stressed to students is that the algebraic formulae that they learn about are not necessarily expressed in the form that's most appropriate computationally. By "appropriate", I'm referring to both computational accuracy and computational speed. There are actually lots and lots of examples that illustrate the point that I want to make. However, let's just consider the "simple problem" of computing the variance of a sample of data.
Different estimators of the population variance have been the topic of a couple of my recent posts (here and here). I'll be concentrating here on the usual sample variance formula,

s2 = [1 / (n - 1)] Σ[(xi - x*)2] ,                                (1)

where "n" is the sample size, and x* = (1 / n)Σ(xi) is the simple sample mean.

However, the focal point of what I'll be talking about is the term, Σ[(xi - x*)2], so my comments apply in varying degrees to other related measures that we use to estimate the population variance, or to describe the variability of a sample.

So, what's the big deal here? The formula for s2 looks straightforward enough. First, you compute the sample average, x*. Then, you take the difference between each sample observation and x*. You square each of these differences. You add up all of the squared difference; and finally, you divide the answer by a number. Nothing to it!

What about the number of steps that were required to do this? What about the possibility of "rounding errors" as you go through the various steps I've outlined?

Are these the steps that are actually followed when you are using a statistical package and you use a command such as:

scalar  SAMPVAR = @vars(X)                              (in EViews)

or

SAMPVAR = var(X)                                               (in R or in gretl)  ?

Before we answer this question, notice that there's more than one way to write the expression for s2. For example, a mathematically equivalent formula is:

s2 = [1 / (n - 1)] [Σ(xi2) - nx*2] .                                 (2)

The formula in (1) is sometimes called "the direct method" for computing s2, and the formula in (2) is sometimes called the "one pass method". In a world of perfect arithmetic, both methods would yield the same numerical result.

The "one pass method" used to be popular some years ago when we relied on calculators with very limited memory, but no decent statistical package would use it today. The reason is that it's prone to "catastrophic loss of precision" (to borrow a phrase from Braun and Murdoch, 2007, p.11). Those authors show that if applied to the sample {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}, coding the formulae (1) and (2) in R yields the correct answer of 11. However, if you add the value 1.e10 to each sample value, coding formula (2) in R yields the value 0 instead of 11. (Recall that adding a constant doesn't alter the variance, so 11 is still the correct result.)

Not surprisingly, the var( . ) function in R doesn't use the "one pass method"!

In fact, there are other mathematically equivalent, and useful, expressions for s2 that you may not have encountered. For instance, we can write:

s2 = [1 / (n(n -1))] [nΣ(xi2) - (Σ(xi))2]    .                     (3)

John Cook calls this the "sum of squares method".

Then there's the "corrected two pass algorithm", suggested by  Chan et al. (1983), and favoured by Press et al. in their various editions of Numerical Recipes:

s2 = [1 / (n - 1)] {Σ(xi -x*)2 - (1 / n) [Σ(xi - x*)]2} .     (4)

Another algebraically equivalent formula is:

s2 = [1 / (2n(n -1))] ΣΣ(xi - xj)2                                  (5)

Now, let's get back to the issue of computational accuracy.

Using a simple simulation experiment, John Cook illustrated that the "sum of squares" formula is hopelessly unreliable. So, we can rule out the formulae in (2) and (3) when it comes to computation. John also found that the "direct method" is very reliable.

In fact, it's essentially as reliable as Welford's (1962) method for computing a variance. The latter method involves a simple algorithm, and is often recommended for the robustness of its computational accuracy (e.g., Knuth, 1997). The algorithm is described in another post by John Cook.

What about the formulae in (4) and (5)? How good are they from a computational perspective?

I'll leave the answer to this question to those of you who'd like to run a little simulation experiment.

So, the take-away message is very simple - just because it's convenient to write a formula in a particular way, algebraically, this doesn't necessarily mean that this representation is the best way to code the calculations. You have to consider computational speed, at least to some extent, and you always need to be thinking about computational accuracy.

References

Braun, W. J. and D. J. Murdoch, 2007. A First Course in Statistical Programming With R, Cambridge University Press, Cambridge.

Chan, T. F., G. H. Golub, and R. J. LeVeque, 1983. Algorithms for computing the sample variance: Analysis and recommendations. American Statistician, 37, 242-247.

Knuth, D. E., 1997. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd ed., Addison Wesley, Reading, MA.

Press, W. H., S. A. Teukolsky, W. T. Vettering, and B. P. Flannery, 1992. Numerical Recipes in FORTRAN 77: The Art of Scientific Computing, 2nd ed., Cambridge University Press, Cambridge.

Welford, B. P., 1962. Note on a method for calculating corrected sums of squares and products. Technometrics, 4, 419-420.