Parallel processing in R

One of the limitations of the R statistical computing language has been its lack of support for parallelization. This is particularly aggravating if your work, like mine, involves a lot of randomized simulation and sub-sampling. These randomized processes are in general easily parallelizable.

The lack of parallelization in R has recently been partially addressed by the release of the multicore package, by Simon Urbanek. This package provides access in R to the fork() system call and related functions, allowing for the creation of and communication with R sub-processes.

The interface of the multicore package is quite low-level and not user-friendly, but a few of high-level wrappers exist. I have been investigating one of them, the foreach package, released by Revolution Computing, a company providing commercial support and services for R. This package adds a straightforward foreach construct to R, which supports parallel processing by plugging into parallel backends such as multicore.

Parallelising one of my subsampling routines was a straightforward matter of changing lapply(seq, function(j) (doSomething(j))) to foreach(j=seq) %dopar% (doSomething(j)). Using five cores I reduced the elapsed time to around 35%; I suspect more would be achievable but the subprocesses were not identical in workload (some were subsampling from larger datasets than others).

Leave a Reply