{section: Map-Reduce Jobs under Condor} +{subsection: NOTE: If you want to try MapReduce with Condor, you want to download the file "mrscriptVanilla.tar.gz} + {subsection: Introduction} Condor provides support for starting Hadoop HDFS services, namely Name- and Datanodes. HDFS data access is up the the user's application, through, including the usage of Hadoop's MapReduce framework. However, we provide a submit file generator script for submitting MapReduce jobs into the vanilla universe (1 jobtracker and n tasktrackers, where n is specified by the user) @@ -20,11 +22,11 @@ {subsubsection: Getting required files} -We have written a handy python script that takes care of a lot of configuration steps involved in creating a job description file. It generates a specially crafted Condor job submit file for your job. It can then be submitted to Condor scheduler using the same script to get back the tracking URL of the job. This URL is where the Job-Tracker' embedded web-server is running. The information about this URL is published as a Job-Ad by mrscript once the Tracker is setup. Using the script you can specify: number of slots (CPUs) to be used, MR cluster parameters e.g. capacity of each Tasktracker, job jar file or a script file if you are trying to submit a set of jobs and also Condor job parameters e.g. 'requirement' attribute. +We have written a handy python script that takes care of a lot of configuration steps involved in creating a job description file. It generates a specially crafted Condor job submit file for your job. It can then be submitted to Condor scheduler using the same script to get back the tracking URL of the job. This URL is where the Job-Tracker' embedded web-server is running. The information about this URL is published as a Job-Ad by mrscriptVanilla once the Tracker is setup. Using the script you can specify: number of slots (CPUs) to be used, MR cluster parameters e.g. capacity of each Tasktracker, job jar file or a script file if you are trying to submit a set of jobs and also Condor job parameters e.g. 'requirement' attribute. -This script will soon be part of Condor distribution but for now you can just use the latest version attached with this wiki. The attached file (mrscript.tar.gz) contains three files: +This script will soon be part of Condor distribution but for now you can just use the latest version attached with this wiki. The attached file (mrscriptVanilla.tar.gz) contains two files: -*: mrscript.py - the main script that generates the submit file and is also submitted as part of the user job to setup the cluster. +*: mrscriptVanilla.py - the main script that generates the submit file and is also submitted as part of the user job to setup the cluster. *: subprocess.py - this is the python module (not available in 2.4 version or lesser) that we use to manage java processes while setting up the MR cluster. You only need this if you are using an older version of python. @@ -34,7 +36,7 @@ {subsubsection: Configuration Parameters} -The mrscript.py require you to specifying certain configuration parameters e.g. number of CPU slots to request, location of Hadoop installation directory on local machine etc. Below is a list of configuration variables whose value you should decide before running the script to generate submit file. +mrscriptVanilla.py requires you to specify certain configuration parameters e.g. number of CPU slots to request, location of Hadoop installation directory on your submit machine etc. Below is a list of configuration variables whose value you should decide on before running the script to generate submit file. 1:The URL of the Hadoop name-node server. 1:The java home directory @@ -53,7 +55,7 @@ Once you have decided upon the values of all of the above-mentioned parameters, you are ready to generate the file. Use the mrscript.py to generate the job.desc file. {code} - ./mrscript.py -m <map> -r <reduce> -j <java> -c <count> -n <URL> -f <user-file1> -f <user-file2> -c <key-value pair> job-jar-file 'args for job-jar-file' + ./mrscriptVanilla.py -m <map> -r <reduce> -j <java> -c <count> -n <URL> -f <user-file1> -f <user-file2> -c <key-value pair> job-jar-file 'args for job-jar-file' (Where) : -m: map capacity of a single Task-Tracker -r: reducer capacity of a single Task-Tracker