Yesterday I had a short post reminding EViews users that their package (versions 7 or 8) will access all of the cores on a multi-core machine. I've been playing around with parallel processing in R on my desktop machine at work over the last few days. It's something I've been meaning to do for a while, and it proved to be well worth the time.
Before I share my results with you, let me make a couple of comments.
First, parallel processing involves some costs in terms of communication overheads, so not all tasks are well-suited to this type of processing. It's easy to generate examples that are computationally intensive, but execute faster on a single processor than on a cluster (of cores, or machines).
Second, even when a task is suitable for parallel processing, don't expect the reduction in elapsed time to be linearly related to the increase in the number of cores. Remember, there are overheads involved!
Recently, there have been some posts out there that have illustrated the advantages of parallel processing in R. For example, WenSui Liu posted a piece describing some experiments run using the Ubuntu O/S. Also, Daniel Marcelino had a post that compared various "parallel" packages in R on a MacBook Pro. Nice choice of machine - it's running UNIX beneath that pretty cover! And then, just as I was writing this post today, Arthur Charpentier came out with this related post, also based on results using a Mac.
However, none of these posts deal with a Windows environment, or the sorts of Monte Carlo or bootstrap simulations that econometricians use all of the time. So, I felt that there was something more to explore.
The first thing that I discovered, after a lot of digging around, is that although there's a number of R packages to help with parallel processing, if you're running Windows then your options are limited. O.K., that's no surprise, of course! Don't write comments saying that I should be using a different O/S if I want to engage in fast computing. I know that!
However, let's stick with Windows. In that case it seems that the snowfall package for R is the best choice, currently. That's what the results below are based on.
Before I share my results with you, let me make a couple of comments.
First, parallel processing involves some costs in terms of communication overheads, so not all tasks are well-suited to this type of processing. It's easy to generate examples that are computationally intensive, but execute faster on a single processor than on a cluster (of cores, or machines).
Second, even when a task is suitable for parallel processing, don't expect the reduction in elapsed time to be linearly related to the increase in the number of cores. Remember, there are overheads involved!
Recently, there have been some posts out there that have illustrated the advantages of parallel processing in R. For example, WenSui Liu posted a piece describing some experiments run using the Ubuntu O/S. Also, Daniel Marcelino had a post that compared various "parallel" packages in R on a MacBook Pro. Nice choice of machine - it's running UNIX beneath that pretty cover! And then, just as I was writing this post today, Arthur Charpentier came out with this related post, also based on results using a Mac.
However, none of these posts deal with a Windows environment, or the sorts of Monte Carlo or bootstrap simulations that econometricians use all of the time. So, I felt that there was something more to explore.
The first thing that I discovered, after a lot of digging around, is that although there's a number of R packages to help with parallel processing, if you're running Windows then your options are limited. O.K., that's no surprise, of course! Don't write comments saying that I should be using a different O/S if I want to engage in fast computing. I know that!
However, let's stick with Windows. In that case it seems that the snowfall package for R is the best choice, currently. That's what the results below are based on.
Well, here are a couple of small examples, run on my DELL desktop. It has an Intel I7-3770 processor (8 4 cores + hyperthreading), and 12GB of RAM. I'm running Windows 7 (64 bit).
Test 1:
This test involves bootstrapping the sampling distribution of an OLS estimator. Of course, we know the answer - this is just an illustration of processing times!
This test involves bootstrapping the sampling distribution of an OLS estimator. Of course, we know the answer - this is just an illustration of processing times!
There are 9,999 replications. The R script is on the code page for this blog, and it's a slightly modified version of an example given by Knaus et al. (2009).
Test 2:
This test involves a Monte Carlo simulation of the power of a paired t-test, using 1,999 replications, and sample sizes of n = 10 (5) 200. Again, the R script is on the code page for this blog, and it's a modified version of an example given by Spector (undated)
Test 2:
This test involves a Monte Carlo simulation of the power of a paired t-test, using 1,999 replications, and sample sizes of n = 10 (5) 200. Again, the R script is on the code page for this blog, and it's a modified version of an example given by Spector (undated)
The results when we allow R to access different numbers of cores are:
BTW - this is what I really enjoyed seeing - all of the cores on my machine running at full steam!
Of course, these processing times could be improved a lot by moving to an environment other than Windows! The point of the exercise, though, is simply to show you the effect of grabbing more cores when running a simulations of the type that we use a lot in econometrics.
References
Knaus, J., C. Porzelius, H. Binder, & G. Schwarzer, 2009. Easier parallel computing in R with snowfall and sfCluster. The R Journal, 1/1, 54-59.
Spector, P., undated. Using the snowfall library in R. Mimeo., Statistical Computing Facility, Department of Statistics, University of California, Berkeley.
References
Knaus, J., C. Porzelius, H. Binder, & G. Schwarzer, 2009. Easier parallel computing in R with snowfall and sfCluster. The R Journal, 1/1, 54-59.
Spector, P., undated. Using the snowfall library in R. Mimeo., Statistical Computing Facility, Department of Statistics, University of California, Berkeley.
© 2013, David E. Giles
I use parallel processing in Matlabs for some applications (e.g. parfor loops to perform multiple Garch estimations). However, the statistics that usually are slowest for me (i.e. MCMC/Gibbs sampling) don't seem to benefit much from parallel processing.
ReplyDeleteJohn - not sure about MATALB. I'm not a user.
DeleteMCMC, regular Monte Carlo, and bootstrapping are always touted as classic examples of where parallel processing is advantageous. The examples I gave certainly illustrate this point. Indeed, I recall going to sessions at JSM in early 80's where people were using lots of PC's to run MCMC. I always understood that this is what got MCMC off the ground, so to speak.
My son, a comp. sci. grad., confirmed with his experience with this type of simulation using several thousands of cores on Westgrid!
I'm not an expert on the technique, but let me spell out my issue. Suppose initially you do a Gibbs Sampling estimation of a multivariate normal distribution with an inverse wishart prior. You burn x,000 iterations and draw y,000 from the posterior predictive. There are probably more advanced parallel techniques, but suppose you split up the problem on to 5 processors. This way you could split the y,000 posterior draws into y,000/5 on five processors. However, you still need to do the burn-in on x,000. So either you do the x,000 burn-in on each of the five-processors or you do it once and just assume it's fine to use the same burn-in on each of the five processors (you still have the bottleneck on the burn-in though).
DeleteWhile Gibbs Sampling for large problems is slow, it's really not my biggest issue with it. My problem is that if I estimate a Bayesian VAR, then the mean of the coefficients at each iteration often has all eigenvalues modulo less than 1 (within unit circle/stationary), but when you sample from the distribution it is very rare (and more rare the larger the VAR) that the coefficients will be within the unit circle.
John - re. burn-in constraints - have you looked at either of these:
Deletehttp://dcr.r-forge.r-project.org/tutorials/dcpar3.pdf
http://www.stats.ox.ac.uk/__data/assets/pdf_file/0003/8418/parproj.pdf
Don't know about second problem.
DG
Your CPU actually only have 4 cores and hyperthreading. That is why there is so much less gain in performance with >4 threads in your benchmarks.
ReplyDeleteAha!!! Thanks!
DeleteI fell into the same trap a few years ago.
ReplyDeletehttp://stackoverflow.com/questions/3547831/different-behavior-when-using-different-number-of-multicoring-workers