HTCondorWiki: Map Reduce

 
 Condor provides a powerful execution environment for running parallel applications like MPI.  The Parallel Universe  (PU) of Condor is built specifically for this purpose that provides a thin layer of resource management for a wide variety of parallel programming environments. The Map-Reduce (MR) is a relatively recent programming model particularly suitable for applications that require processing a large set of data on cluster of computers. A popular open-source implementation of MR framework is provided by Hadoop project by apache software foundation. This document provides instruction for running Hadoop's MapReduce framework under Condor's opportunistic scheduler.
 
-The support we provided and tested for Map-Reduce (MR) involves using Condor resource manager to select a subset of machines within a cluster, setup a Hadoop MR cluster on this subset, submit a MR job and finally clean-up once the job is finished.  The Condor&#8217;s Parallel Universe already provides support for master-slave style applications e.g. MPI.  A user job written to run under PU has the ability to select a set of machines based on certain criteria (specified using job class-ads).  These machines will be available as dedicated resources for the duration of the job.  Additionally, a user can choose which machine should act as a master and communication channels between masters and slave nodes are also established.  In view of these capabilities, Condor has already provided the mechanism to build an on-demand MR cluster.
+The support we provided and tested for Map-Reduce (MR) involves using Condor resource manager to select a subset of machines within a cluster, setup a Hadoop MR cluster on this subset, submit a MR job and finally clean-up once the job is finished.  The Condor's Parallel Universe already provides support for master-slave style applications e.g. MPI.  A user job written to run under PU has the ability to select a set of machines based on certain criteria (specified using job class-ads).  These machines will be available as dedicated resources for the duration of the job.  Additionally, a user can choose which machine should act as a master and communication channels between masters and slave nodes are also established.  In view of these capabilities, Condor has already provided the mechanism to build an on-demand MR cluster.
 
 The above mentioned approach has several advantages over running a dedicated MR cluster e.g. ability to support several execution environments and doing better resource management among these environments.  Below we summarize some of the advantages of our approach of running an on-demand MR cluster for specific set of jobs.
 
 1:   Condor has powerful match making capabilities using excellent framework based on Class-Ad mechanism.  These capabilities can be exploited to implement multiple policies for a MR cluster beyond the current capabilities of existing frameworks.
 
-1: MR style of computation might not be suitable for all sorts of applications or problems (e.g. the ones which are inherently sequential).  A support for multiple execution environments is needed along with different set of policies for each environment.
+1: MR style of computation might not be suitable for all sorts of applications or problems (e.g. the ones which are inherently sequential).  A support for multiple execution environments is needed along with different set of policies for each environment. Condor supports a wide variety of execution environment including MPI style jobs, VMWare job etc.
 
-1:
+1: Perhaps, one of the bigger advantages is related to capacity management with a large shared MR cluster. Currently, the Hadoop MR framework has a very limited support for managing users' job priorities.
 
 {subsection: Pre-requisite}
 
@@ -24,15 +24,15 @@
 
 {subsubsection: Getting required files}
 
-We have written a handy python script that takes care of a lot of configuration steps involved in creating a job description file. It generates a specially crafted Condor job submit file for your job. It can then be submitted to Condor scheduler using the same script to get back the tracking URL of the job. This URL is where the Job-Tracker&#8217;s embedded web-server is running. (TODO: I might want to explain how we extract the URL of the job server).  Using the script you can specify: number of slots (CPUs) to be used, MR cluster parameters e.g. capacity of each TaskTracker, job jar file or a script file if you are trying to submit a set of jobs and also Condor job parameters e.g. 'requirement' attribute.
+We have written a handy python script that takes care of a lot of configuration steps involved in creating a job description file. It generates a specially crafted Condor job submit file for your job. It can then be submitted to Condor scheduler using the same script to get back the tracking URL of the job. This URL is where the Job-Tracker' embedded web-server is running. The information about this URL is published as a Job-Ad by mrscript once the Tracker is setup.  Using the script you can specify: number of slots (CPUs) to be used, MR cluster parameters e.g. capacity of each TaskTracker, job jar file or a script file if you are trying to submit a set of jobs and also Condor job parameters e.g. 'requirement' attribute.
 
-This script will soon be part of Condor distribution but for now you can just use the latest version attached with this wiki.  The mrscript.tar.gz contains three files:
+This script will soon be part of Condor distribution but for now you can just use the latest version attached with this wiki.  The attached file (mrscript.tar.gz) contains three files:
 
 *: mrscript.py - the main script that generates the submit file and is also submitted as part of the user job to setup the cluster.
 
 *: subprocess.py - this is the python module (not available in 2.4 version or lesser) that we use to manage java processes while setting up the MR cluster. You only need this if you are using an older version of python.
 
-*: hadoop - This is a slightly modified version of &#8216;hadoop&#8217; script under the bin directory of Hadoop software. This can be used to submit a batch of map-reduce job to a single instance of MR cluster setup under PU. See example section below to get some idea of how this can be used.
+*: hadoop - This is a slightly modified version of 'hadoop' script under the bin directory of Hadoop software. This can be used to submit a batch of map-reduce job to a single instance of MR cluster setup under PU. See example section below to get some idea of how this can be used.
 
 You will also need the copy of hadoop software on the machine from where you are submitting the job. This is required, as all required Jar file for running MR processes are copied to machine selected for a given job.  As different versions of Hadoop are not compatible with each other, make sure to download the same version as used in Hadoop distributed file system.
 
@@ -55,9 +55,27 @@
 
 Once you have decided upon the values of all of the above-mentioned parameters, you are ready to generate the file. Use the mrscript.py to generate the job.desc file.
 
-mrscript.py  -m <map> -r <reduce> -j <java> -c <count> -n <URL> -f <user-file1> -f <user-file2>
-The detailed documentation for all the options accepted by mrscript.py is given here.
+{code}
+	./mrscript.py  -m <map> -r <reduce> -j <java> -c <count> -n <URL> -f <user-file1> -f <user-file2> -c <key-value pair> job-jar-file 'args for job-jar-file'
+             (Where) :
+               -m: map capacity of a single Task-Tracker
+               -r: reducer capacity of a single Task-Tracker
+               -j: Location of java home directory on local machine
+               -c: Number of machines to request for running MR cluster
+               -n: URL of hadoop name-node server
+               -f: You can use this option mutliple times to specify additional files that should go with your job.
+                   Note that you don't need to specify any of the Hadoop core Jar files.
+               -c: Key-value pair corresponding to Hadoop XML configuration files that are placed in appropriate XML files when setting up the MR cluster.
+                   You can use this option multiple times.
+{endcode}
+
+The above command with right set of parameters will generate 'job.desc' file under current directory. You can directly submit this file. However in that case you will have to extract the location of Job-Tracker web-server from the Job-Ads. It is easy to just use the submit option in mrscript.py for this that periodically poll Job-Ads and pulls the tracker's URL and exits. Here is a simple command to do this step.
+
+{code}
+         ./mrscript.py -s
+{endcode}
+
 
 {subsection: Examples}
 
-Below is an example configuration for running a sample MapReduce job that comes with Hadoop installation. This is a simple sleep job that is part of hadoop-<version string>-examples.jar.
+TODO