Here are some resources mentioned during the "Parallel Computing in R" talk I'm giving at the co-hosted meeting of the San Diego R User Group and the PACE Tech Talk Series on September 10, 2013.
Alternatively, you can send me an email (glock at sdsc dot edu) and I'd be happy to send you the slides myself.
Here are download links for the scripts and the text of Moby Dick used in the Hadoop examples. You can copy+paste the kmeans examples straight from GitHub into your laptop's R installation to try out the multicore examples.
- GitHub repository containing all parallel R sample scripts (and batch submit scripts for Gordon)
- Moby Dick by Herman Melville for word count examples
SDSC and XSEDE
If you are interested in getting compute time on Gordon, you can quickly get a startup allocation through the National Science Foundation's XSEDE program for up to 100,000 core-hours by writing up an abstract of the work you would like to do. For more information, either check out the XSEDE allocations webpage or send us an email at firstname.lastname@example.org. Gordon (and this presentation) is funded under NSF award OCI-0910847.
If you are interested in attending PACE's Data Mining Boot Camp, you can register via the PACE Boot Camps webpage. The first Boot Camp will be held September 12 and 13th and members of the San Diego R User Group get a 10% discount on the $1295 registration fee (contact me for the promo code). People with academic affiliations (a .edu email address) are eligible for a $600 discount.
As a personal testimonial (note: I am not funded by PACE), these workshops are a really great way to get both conceptual and hands-on training in a lot of the principles that are key to predictive analytics. The sessions are led by very high-caliber individuals, and even if nothing else, are an exceptional way to connect with local experts in the field of predictive analytics. PACE also hosts monthly Tech Talks which are free to the public and may be of interest to members of the San Diego R User Group. This talk of mine is being cross-promoted as a Tech Talk, and I'm hoping that hosting the September meeting at SDSC opens the doors to future cross-collaboration between PACE and the R User Group.
Here are a few guides we have created at SDSC that are relevant to this particular talk:
- Parallel R
- Gordon User Guide: Hadoop - Our official guide to running Hadoop jobs on XSEDE's Gordon resource at SDSC
- Conceptual Overview of Map/Reduce and Hadoop - The essentials of map/reduce and what sorts of problems it can solve efficiently
- Writing Hadoop Jobs in Python with Hadoop Streaming - A more in-depth walkthrough of the word count example for Hadoop streaming. The page uses mappers/reducers written in Python, but the associated GitHub repository contains Python, Perl, and R mappers/reducers.
- Running Hadoop Clusters on Traditional HPC - Gory details on how the Hadoop submit scripts provided by SDSC work.
- Parsing VCF Files with Hadoop Streaming - A real-world example of using Python, one of its bioinformatics libraries, and Hadoop to analyze genetic data
- Parallel R
- Jonathan Olmsted's Site - Full of great tutorials for running parallel R
- Ryan Hafen's Rhipe/Hadoop Tutorial - A really nice set of tutorials on Hadoop and using Rhipe
- Rhipe 0.65.2 documetation - The full set of options for Rhipe's map/reduce interface, and what knobs you can and cannot turn
- How to run R programs on tara - U of Maryland, Baltimore County's website on running parallel R on supercomputers. Includes helpful sample code and submit scripts to use with SLURM
- R module snow - U of Michigan's website on running snow on their clusters written. Includes submit scripts that work with PBS/Torque
- snow Simplified - Nice
snow-oriented tutorial with plenty of sample codes