HTCondorWiki: Map Reduce

-_Draft & Incomplete_
-
 {section: Map-Reduce Jobs under Condor}
 
 {subsection: Introduction}
 
-Condor provides a powerful execution environment for running parallel applications like MPI.  The Parallel Universe  (PU) of Condor is built specifically for this purpose that provides a thin layer of resource management for a wide variety of parallel programming environments. The Map-Reduce (MR) is a relatively recent programming model particularly suitable for applications that require processing a large set of data on cluster of computers. A popular open-source implementation of MR framework is provided by Hadoop project by apache software foundation. This document provides instruction for running Hadoop's MapReduce framework under Condor's opportunistic scheduler.
-
-The support we provided and tested for Map-Reduce (MR) involves using Condor resource manager to select a subset of machines within a cluster, setup a Hadoop MR cluster on this subset, submit a MR job and finally clean-up once the job is finished.  The Condor's Parallel Universe already provides support for master-slave style applications e.g. MPI.  A user job written to run under PU has the ability to select a set of machines based on certain criteria (specified using job class-ads).  These machines will be available as dedicated resources for the duration of the job.  Additionally, a user can choose which machine should act as a master and communication channels between masters and slave nodes are also established.  In view of these capabilities, Condor has already provided the mechanism to build an on-demand MR cluster.
+Condor provides support for starting Hadoop HDFS services, namely Name- and Datanodes. HDFS data access is up the the user's application, through, including the usage of Hadoop's MapReduce framework. However, we provide a submit file generator script for submitting MapReduce jobs into the vanilla universe (1 jobtracker and n tasktrackers, where n is specified by the user)
 
-The above mentioned approach has several advantages over running a dedicated MR cluster e.g. ability to support several execution environments and doing better resource management among these environments.  Below we summarize some of the advantages of our approach of running an on-demand MR cluster for specific set of jobs.
+Why running MapReduce job under Condor at all?
 
 1:   Condor has powerful match making capabilities using excellent framework based on Class-Ad mechanism.  These capabilities can be exploited to implement multiple policies for a MR cluster beyond the current capabilities of existing frameworks.
 
@@ -16,15 +12,15 @@
 
 1: Perhaps, one of the bigger advantages is related to capacity management with a large shared MR cluster. Currently, the Hadoop MR framework has a very limited support for managing users' job priorities.
 
-{subsection: Pre-requisite}
+{subsection: Prerequisites}
 
-The support for MR jobs under Condor is based on the availability of Parallel Universe. This document assumes that you have a Condor Parallel Universe setup ready. Mainly you need to configure a dedicated scheduler. The details of this step can be found in the Condor manual (see section 2.9 and 3.13.10).  Additionally, you need to have a distributed file system setup e.g. Hadoop distributed file system (HDFS).  Starting from version 7.5 Condor comes with a storage daemon that provides support for HDFS. More details about our HDFS daemon can be found in Condor manual (see section 3.3.23 and 3.13.2).  Apart from these python version 2.4 or above is required on all the machines, which are part of PU.
+You need to have a distributed file system setup e.g. Hadoop distributed file system (HDFS).  Starting from version 7.5 Condor comes with a storage daemon that provides support for HDFS. More details about our HDFS daemon can be found in Condor manual (see section 3.3.23 and 3.13.2).  Apart from these python version 2.4 or above is required on all the machines, which are part of PU.
 
 {subsection: Submitting a Job}
 
 {subsubsection: Getting required files}
 
-We have written a handy python script that takes care of a lot of configuration steps involved in creating a job description file. It generates a specially crafted Condor job submit file for your job. It can then be submitted to Condor scheduler using the same script to get back the tracking URL of the job. This URL is where the Job-Tracker' embedded web-server is running. The information about this URL is published as a Job-Ad by mrscript once the Tracker is setup.  Using the script you can specify: number of slots (CPUs) to be used, MR cluster parameters e.g. capacity of each TaskTracker, job jar file or a script file if you are trying to submit a set of jobs and also Condor job parameters e.g. 'requirement' attribute.
+We have written a handy python script that takes care of a lot of configuration steps involved in creating a job description file. It generates a specially crafted Condor job submit file for your job. It can then be submitted to Condor scheduler using the same script to get back the tracking URL of the job. This URL is where the Job-Tracker' embedded web-server is running. The information about this URL is published as a Job-Ad by mrscript once the Tracker is setup.  Using the script you can specify: number of slots (CPUs) to be used, MR cluster parameters e.g. capacity of each Tasktracker, job jar file or a script file if you are trying to submit a set of jobs and also Condor job parameters e.g. 'requirement' attribute.
 
 This script will soon be part of Condor distribution but for now you can just use the latest version attached with this wiki.  The attached file (mrscript.tar.gz) contains three files:
 
@@ -32,7 +28,6 @@
 
 *: subprocess.py - this is the python module (not available in 2.4 version or lesser) that we use to manage java processes while setting up the MR cluster. You only need this if you are using an older version of python.
 
-*: hadoop - This is a slightly modified version of 'hadoop' script under the bin directory of Hadoop software. This can be used to submit a batch of map-reduce job to a single instance of MR cluster setup under PU. See example section below to get some idea of how this can be used.
 
 You will also need the copy of hadoop software on the machine from where you are submitting the job. This is required, as all required Jar file for running MR processes are copied to machine selected for a given job.  As different versions of Hadoop are not compatible with each other, make sure to download the same version as used in Hadoop distributed file system.
 
@@ -44,11 +39,13 @@
 1:The URL of the Hadoop name-node server.
 1:The java home directory
 1:The hadoop installation directory
-1:The map capacity of each TackTracker.
-1:The reduce capacity of each JobTracker.
-1:The count for machines the parallel universe should use for your job.  One of these machines will be used for JobTracker and rest for TaskTracker.
+1:The map capacity of each Tasktracker.
+1:The reduce capacity of each Tasktracker.
+1:The number of Tasktrackers that should be used for your job. There will only be one Jobtracker submitted!
 1:The jar file for your job
 1:The parameters passed to your job.
+1:The number of mappers running per Tasktracker
+1:The number of reducers running per Tasktracker
 1:(Optional) The list of user files that should be sent with your job. These files are other than the ones that mrscript.py is configured to send.
 
 {subsubsection: Generating job submit file}
@@ -61,21 +58,21 @@
                -m: map capacity of a single Task-Tracker
                -r: reducer capacity of a single Task-Tracker
                -j: Location of java home directory on local machine
-               -c: Number of machines to request for running MR cluster
+               -c: Number of machines used as Tasktrackers
                -n: URL of hadoop name-node server
-               -f: You can use this option mutliple times to specify additional files that should go with your job.
+               -f: You can use this option multiple times to specify additional files that should go with your job.
                    Note that you don't need to specify any of the Hadoop core Jar files.
                -c: Key-value pair corresponding to Hadoop XML configuration files that are placed in appropriate XML files when setting up the MR cluster.
                    You can use this option multiple times.
 {endcode}
 
-The above command with right set of parameters will generate 'job.desc' file under current directory. You can directly submit this file. However in that case you will have to extract the location of Job-Tracker web-server from the Job-Ads. It is easy to just use the submit option in mrscript.py for this that periodically poll Job-Ads and pulls the tracker's URL and exits. Here is a simple command to do this step.
-
-{code}
-         ./mrscript.py -s
-{endcode}
+The above command with right set of parameters will generate 'job.desc' file under current directory. You can directly submit this file. Potentially you may want to add certain requirements (=OpSys=, =Arch=, Java version ...)
 
 
 {subsection: Examples}
 
-TODO
+Assume you have a jar file "wc.jar" (i.e. Hadoop's wordcount example). The a possible call for creating your submit file could be
+
+{code}
+./mrscriptVanilla.py -p /path/to/hadoop-0.21.0/ -j /path/to/java/ -m 1 -r 2 -n my.namenode.edu:54310  wc.jar org.apache.hadoop.examples.WordCount hdfs://my.namenode.edu:54310/input hdfs://my.namenode.edu:54310/output
+{endcode}