Submit DAGMan as an HTCondor-C Job
DAGMan can manager grid universe (universe=grid) jobs without any further work. However, if you would like to submit DAGMan itself to run on a remote host, there is a bit more work. The following technique is specific to HTCondor-C (grid_resource=condor), although some notes on other grid types are mentioned.
This discussion assumes you are comfortable using both DAGMan and submitting HTCondor-C jobs.
- Ensure that you can successfully submit a simple HTCondor-C job to the remote site. DAGMan adds additional complexity, and it's best to ensure you have the ability to successfully submit a job before adding that complexity.
- Collect all of the files associated with your DAG in a single submit directory. This include the DAG file, the individual submit files, the executables, and input files.
- Ensure that your arguments and paths are bare file names without path information. "Arguments = /home/adesmet/test" may not work, as /home/adesmet/test may not exist on the remote site. Instead simply use "test" and ensure that test is in the single submit directory.
- Generate a submit file for your DAG using "condor_submit_dag -no_submit DAGfile". Assuming your DAG file is "DAGfile, the generated submit file will be DAGfile.condor.sub. Open the submit file in an editor.
- In the DAG file, change the "universe" from "scheduler" to "grid".
- In the DAG file, before the "queue" line we'll be adding additional information as follows.
- Add the appropriate grid_resource to your DAG file. For example, "grid_resoruce = condor condorc.example.com centralmanager.example.com"
- Add "remote_universe=schedule" to you DAG file.
- Add "WhenToTransferOutput = ON_EXIT" to your DAG file.
- Identify all of your input files. These include: The DAG file itself, all of your individual submit files, and any executables and input files specified by your individual submit files via "input" or "transfer_input_files". these files should all be specified in a new "transfer_input_files" in the DAG file.
- Submit the DAGfile.condor.sub using condor_submit.
Differences with other grid types:
- DAGMan requires a working HTCondor install on the remote system. At the very least a running condor_schedd is required, along with a working condor_submit. With HTCondor-C you can largely assume one is present; after all HTCondor-C is designed to submit to a working HTCondor install. For other systems you will need the system in place and whatever environment and paths are necessary must either be set by your job or the remote grid system.
- HTCondor-C automatically pulls back any output files that are created or modified. Most other grid types do not. You will need to list any output files you want back in transfer_output_files to retrieve them. Some grid systems fail if output files are missing, so you may want to create empty versions and specify them as transfer_input_files to ensure that they are present when the job finishes.
First, I created a test executable called, "executable" and marked it executable with "chmod a+x executable". The contents are:
#! /bin/sh /bin/date /bin/hostname
I ran "./executable" to ensure that it worked.
Next I created a simple submit file to ensure that I could use HTCondor-C.
executable = executable output = eraseme.out error = eraseme.err log = eraseme.log universe= grid grid_resource= condor puffin.cs.wisc.edu condor.cs.wisc.edu remote_universe=scheduler queue
I ran "condor_submit submit-remote", then waited for the job to run and return. I checked eraseme.out to confirm that it successfully ran.
This is a submit file that represents a "real" job. This is just an example, so it just runs my same "executable". The input and transfer_input_files are similar pointless examples and are empty, but are here to illustrate handling additional input files. Also, for the sake of simplicity this job is a "scheduler" universe job. A more realistic example might use "vanilla".
executable = executable output = eraseme.out error = eraseme.err input = test-input transfer_input_files = more-input WhenToTransferOutput = ON_EXIT arguments = -f more-input log = eraseme.log universe= scheduler queue
I submitted this job ("condor_submit submit-local") to confirm that it worked.
Now, a minimal DAG file:
JOB ONLY_JOB submit-local
I created dag.condor.sub by running "condor_submit_dag -no_submit dag". I made several modifications, marked with comments below:
# Filename: dag.condor.sub # Generated by condor_submit_dag dag ### I have changed the universe from scheduler to universe universe=grid ### The next three lines are new, and specify where to send ### my job, and to treat it as a scheduler universe job at ### that remote site. grid_resource= condor puffin.cs.wisc.edu condor.cs.wisc.edu remote_universe=scheduler ### executable = /scratch/adesmet/rust/admin-19650/condor-7.2.4/bin/condor_dagman getenv = True output = dag.lib.out error = dag.lib.err log = dag.dagman.log ### The following two lines were added. Notice that ### all input files have been specified. This includes ### the DAG file ("dag"), the job's submit file (a ### multi-part DAG might have multiple submit files) ### the executable ("executable", again there might be ### more), the input file (test-input, might be more ### than one), and the transfer_input_files (more-input, ### more be more than one). transfer_input_files=dag,submit-local,executable,test-input,more-input WhenToTransferOutput = ON_EXIT ### remove_kill_sig = SIGUSR1 # Note: default on_exit_remove expression: # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed (e.g., during a reboot). on_exit_remove = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False ### The arguments line has been broken across multiple lines to be more ### readable. You do NOT need to do this. arguments = "-f -l . -Debug 3 -Lockfile dag.lock -AutoRescue 1 \ -DoRescueFrom 0 -Dag dag -CsdVersion $CondorVersion:' '7.2.4' 'Jun' '16'\ '2009' 'BuildID:' '159529' '$" environment = _CONDOR_DAGMAN_LOG=dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue
Finally, I submitted the job as "condor_submit dag.condor.sub".