Wednesday, July 8, 2015

Parallel Computing for Data Science

Hot off the press, Norman Matloff's book, Parallel Computing for Data Science: With Examples in R, C++ and CUDA  (Chapman and Hall/ CRC Press, 2015) should appeal to a lot of the readers of this blog.

The book's coverage is clear from the following chapter titles:

1. Introduction to Parallel Processing in R
2. Performance Issues: General
3. Principles of Parallel Loop Scheduling
4. The Message Passing Paradigm
5. The Shared Memory Paradigm
6. Parallelism through Accelerator Chips
7. An Inherently Statistical Approach to Parallelization: Subset Methods
8. Distributed Computation
9. Parallel Sorting, Filtering and Prefix Scan
10. Parallel Linear Algebra
Appendix - Review of Matrix Algebra 

The Preface makes it perfectly clear what this book is intended to be, and what it is not intended to be. Consider these passages:

“Unlike almost every other book I’m aware of on parallel computing, you will not find a single example here dealing with solving partial differential equations and other applications of physics………” 
That pretty much says it all!

"While the book is chock full of examples, it aims to emphasize general principles. Accordingly, after presenting an introductory code example in Chapter 1 (general principles are meaningless without real examples to tie them to), I devote Chapter 2 not so much as how to write parallel code, as to explaining what the general factors are that can rob a parallel program of speed. Indeed, one can regard the entire book as addressing the plight of the poor guy described at the beginning of Chapter 2: 
Here is an all-too-common scenario:
 An analyst acquires a brand new multicore machine, capable of wondrous things. With great excitement, he codes up his favorite large problem on the new machine—only to find that the parallel version runs more slowly than the serial one. What a disappointment! Let’s see what factors can lead to such a situation... 
One thing this book is not, is a user manual. Though it uses specific tools throughout, such as R’s parallel and Rmpi libraries, OpenMP, CUDA and so on, this is for the sake of concreteness. The book will give the reader a solid introduction to these tools, but is not a compendium of all the different function arguments, environment options and so on. The intent is that the reader, upon completing this book, will be well-poised to learn more about these tools, and most importantly, to write effective parallel code in various other languages, be it Python, Julia or whatever."
From my reading of the book, Matloff achieves his goals, and in doing so he has provided a volume that will be immensely useful to a very wide audience. I can see it being used as a reference by data analysts, statisticians, engineers, econometricians, biometricians, etc. This would apply to both established researchers, and graduate students. This book provides exactly the sort of information that this audience is looking for, and it is presented in a very accessible and friendly manner.

© 2015, David E. Giles

1 comment:

  1. Found several courses concerning this topic on a course. I think econometrics is an extremely useful instrument for social scientists. Especially in our age when so much data was collected. I admire a lot Piketty's work an global wealth and distribution. It was only possible because we have such effective instruments for huge data processing. Recently started working with databases in SQL, hope would understand everything easily.