The way this is handled in the code is probably not as clean as it could be, because PRE and POST scripts got added to DAGMan after the initial code was written. Because of that, I guess, PRE scripts are handled outside of the ready queue data structure, which is kind of awkward.
There are a number of important data structures relating to job ordering:
Dag::_jobs
: a list of all Job objects (nodes) in the DAG.Dag::_readyQ
: a queue of all jobs that are ready to be submitted (all of their parent nodes have finished, and their PRE scripts, if any, have also finished).Job::_queues[Q_WAITING]
: a list of Job objects (nodes) that this Job is waiting on (Jobs are removed from this queue as they finish). (The Job object also has parents and children queues.)
When DAGMan starts up, the ready queue is empty. Dag::Bootstrap()
calls Dag::StartNode()
on all jobs with empty waiting queues. Dag::StartNode()
either runs the node's PRE script (if there is one) or puts the node into the ready queue. If a job does have a PRE script, and the PRE script succeeds, Dag::PreScriptReaper()
puts the job into the ready queue. (If the PRE script fails, Dag::PreScriptReaper()
marks the job as failed, except in special cases.)
Once a job is in the ready queue, it will eventually get submitted by Dag::SubmitReadyJobs()
, which is called from condor_event_timer()
in dagman_main.cpp; condor_event_timer()
is called by daemoncore (every 5 seconds by default). Note that Dag::SubmitReadyJobs()
will only submit a certain number of jobs each time it is called (that number is configurable). If the attempt to submit the job fails, Dag::SubmitReadyJobs()
calls Dag::ProcessFailedSubmit()
, which puts the job back into the ready queue.
When a job finishes, DAGMan sees the job's terminated event in the appropriate log file, and calls Dag::ProcessTerminatedEvent()
. If the job failed, Dag::ProcessTerminatedEvent()
calls Job::TerminateFailure()
, which marks the job as failed. Dag::ProcessTerminatedEvent()
then calls Dag::ProcessJobProcEnd()
, whether the job succeeded or failed. Dag::ProcessJobProcEnd()
takes a number of possible actions, such as initiating a retry for the node, starting the node's POST script, waiting for other job procs to finish if the cluster contains more than one proc, or marking the node as successful.
TEMPTEMP -- talk about post script
When a node finishes, we call Dag::TerminateJob()
on it; that method goes through the list of this node's children and removes the just-finished node from the children's waiting queues. For each child whose waiting queue becomes empty, it calls Dag::StartNode()
, and the cycle continues.
condor_event_timer()
in dagman_main.cpp gets called every five (by default) seconds. In that function, we call Dag::SubmitReadyJobs()
to submit any jobs that are ready; ready any new node job events (see ???); output the status of the DAG; and check whether the DAG is finished.
- Submission/event reading loop -- every 5 sec (default) via daemoncore
- ready queue -- Job objects go into it when ready to submit
- pre scripts are handled kind of kludgily -- might make more sense for Job objects to go into the ready queue when the PRE script should be submitted
- document list of parents, not done parents in Job objects
- Jobs have lists of parents, pending parents, etc.
- Each job has parents, waiting, and children lists (parents and children lists don't change during run; waiting does)
- submit failure puts job back into ready queue
- pre scripts are handled separately from ready queue -- kind of goofs things up...
- maybe explain how things work w/o pre/post scripts, then add in that info