"Let's not kid ourselves: the most widely used piece of
software for statistics is Excel"
(B. D. Ripley, RSS Conference, 2002)
What a sad state of affairs! Sad, but true when you think of all of the number crunching going on in those corporate towers.
With the billions of dollars that are at stake when some of those spreadsheets are being used by the uninitiated, you'd think (and hope) that the calculations are squeaky clean in terms of reliability. Unfortunately, you'd be wrong!
A huge number of reputable studies over the years - ranging from McCullough (1998, 1999), to the special section in Computational Statistics & Data Analysis in 2008 - have pointed out some of the numerical inaccuracies in various releases of some widely used spreadsheets. With reputations at stake, and the potential for litigation, you'd again think (and hope) that by now the purveyors of such software would be on the ball. Not so, it seems!
A recent, and very comprehensive study by Almiron et al. (2010) in the open-access Journal of Statistical Software is a must-read if you're feeling tempted to use spreadsheets for any statistical/econometric analysis that matters. In other words, for any such analysis!
Would you trust a spreadsheet that claims to use a particular, highly regarded algorithm for generating random numbers, when it has been proven that this claim is false? What about a spreadsheet that, when purportedly computing the cumulative distribution function for the Poisson distribution, fails to compute even the first digit correctly? Regrettably, you (and a few million other people) have probably been falling prey to these errors for a very long time!
The study undertaken by Almiron et al. uses a range of different data-sets, and deals with the performances of five spreadsheets (Calc, Excel, Gnumeric, NeoOffice, and Oleo) running on up to three different operating systems:
The study undertaken by Almiron et al. uses a range of different data-sets, and deals with the performances of five spreadsheets (Calc, Excel, Gnumeric, NeoOffice, and Oleo) running on up to three different operating systems:
i386 / Windows
|
i386 / Ubuntu
|
amd64 / Mac OS
| ||
Calc
|
2.4.1
|
X
|
X
| |
3.0.1
|
X
|
X
|
X
| |
Excel
|
2007
|
X
| ||
2008
|
X
| |||
Gnumeric
|
1.8.3
|
X
| ||
1.9.1
|
X
| |||
NeoOffice
|
2.2.5
|
X
| ||
3.0
|
X
| |||
Oleo
|
1.99.16
|
X
|
The authors checked the number of correct significant digits computed by the spreadsheets when calculating the sample mean, standard deviation, and first-order autocorrelation; the F-statistic in one-way ANOVA tests; linear and non-linear regression; and a range of distribution functions. As I've mentioned already, the quality of the spreadsheets' algorithms for computing pseudo-random numbers is also analyzed.
Read the results and weep! And don't presume that you get what you pay for!
Of course, econometricians are much too careful to fall for this! Apparently not. A fairly recent text book by Barreto and Howland (2006), is extremely well motivated - it uses Monte Carlo simulation to teach basic econometrics. I like the concept a lot. Unfortunately, the book's sub-title leaves me a bit concerned - see the reference below.
The bottom line to all of this? Almiron et al. (2010, p.25) :
"Finally, as a rule of the thumb, every user should be aware that spreadsheets have serious limitations. Other platforms are advisable, being currently R the most dependable FLOSS (Free/Libre Open Source Software, see Almiron et al. 2009)."
I couldn't have said it better myself!
Note: The links to the following references may require that your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.
References
Almiron, M. G., E. S. Almeida and M. N. Miranda (2009). The reliability of statistical functions in four software packages freely used in numerical computation. Brazilian Journal of Probability and Statistics, 23, 107-119.
Almiron, M. G., B. Lopes, A. L. Oliviera, A. C. Medeiros and A. C. Frery (2010). On the numerical accuracy of spreadsheets. Journal of Statistical Software, 34 (4), 1-29.
Barreto, H. and F. M. Howland (2006). Introductory Econometrics: Using Monte Carlo Simulation with Microsoft Excel®. Cambridge University Press, Cambridge.
McCullough, B. D. (1998). Assessing the reliability of statistical software: Part I. The American Statistician, 52, 358-366.
McCullough, B. D. (1999). Assessing the reliability of statistical software: Part II. The American Statistician, 53, 149-159.
© 2011, David E. Giles
Reinhart and Rogoff should have been a wake up call, for those who had not already figured this out. Serious calculations deserve serious data structures and serious algorithms, in the worst case if only to confirm the convenient numbers blurted out by the spreadsheets.
ReplyDeleteUnless it was deliberate.
ReplyDeleteOn page 90 of Barreto and Howland (2006), one can read
ReplyDelete“Finally, note that we employ Excel as a teaching tool and do not recommend its use for serious work in econometrics. Excel can usually (not always!) be relied on to provide correct results for simple analyses such as would be found in the typical term paper in introductory econometrics classes. To be assured of accuracy, however, it is best to use a specialized software package”
I take it from this that they agree with the sentiments of this blog. I totally agree