Here are some resources to be used in conjunction with the SDSC Summer Institute hands-on session on Parallel Options for R.
The most recent pre-release slides for this talk
(Last updated Thursday at 9:04 AM)
- Gordon User Guide: Hadoop - Our official guide to running Hadoop jobs on XSEDE's Gordon resource at SDSC
- Writing Hadoop Jobs in Python with Hadoop Streaming - A more in-depth walkthrough of the word count example for Hadoop streaming. The page uses mappers/reducers written in Python, but the associated GitHub repository contains Python, Perl, and R mappers/reducers.
- Running Hadoop Clusters on Traditional HPC - Gory details on how the Hadoop submit scripts provided by Mahidhar and myself works. Although geared towards FutureGrid machines, we all use the myHadoop framework to manage this process.
- Parsing VCF Files with Hadoop Streaming - A real-world example of using Python, one of its bioinformatics libraries, and Hadoop to analyze genetic data
I've put a tarball on Gordon which contains all of the examples I am covering in my talk on Thursday. You can extract them to your home directory by issuing the following commands:
$ cd$ tar zxvf /home/diag/SI2013-R/parallel_r.tar.gz
You will wind up with four directories and a combination of sample R
*.R) and the job submission scripts necessary to submit
kmeans/contains ALL (serial, single-, and multi-node examples)
gordon-mc.qsubis the job script to use for the single-node kmeans samples
gordon-snow.qsubis the job script to use for the multi-node kmeans samples
streaming/contains the Hadoop streaming mapper and reducer script, Moby Dick (
pg2701.txt), and the submit script you'll need.
rhipe/contains the RHIPE R script and corresponding job submit script
rhadoop/contains the RHadoop R script and corresponding job submit script
You should be able to simply cd into the appropriate directory and do something like qsub wordcount-rhipe.qsub to run the job. No modification should be necessary.
Fun fact: you can compare some of these files side-by-side to see how similar (or different) RHadoop is from RHIPE:
$ vimdiff rhipe/wordcount-rhipe.R rhadoop/wordcount-rhadoop.R
Here are download links for the scripts and the text of Moby Dick used in the Hadoop examples. You can copy+paste the kmeans examples straight from GitHub into your laptop's R installation to try out the multicore examples.