HTCondorWiki: Gridmanager Log

Reading the Gridmanager Log

Normal Logging Level

This section describes entries that always appear in the gridmanager log.

Like all HTCondor daemons, the gridmanager prints a banner when it starts up.

11/08/11 23:38:56 ******************************************************
11/08/11 23:38:56 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
11/08/11 23:38:56 ** /afs/cs.wisc.edu/unsup/condor-production/condor-7.6.4/i386_rhel5/sbin/condor_gridmanager
11/08/11 23:38:56 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
11/08/11 23:38:56 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
11/08/11 23:38:56 ** $CondorVersion: 7.6.4 Oct 17 2011 BuildID: 378655 PRE-RELEASE-UWCS $
11/08/11 23:38:56 ** $CondorPlatform: x86_rhap_5 $
11/08/11 23:38:56 ** PID = 15864
11/08/11 23:38:56 ** Log last touched 11/8 23:32:33
11/08/11 23:38:56 ******************************************************
11/08/11 23:38:56 Using config source: /p/condor/workspaces/jfrey/test.68/condor_config
11/08/11 23:38:56 Using local config sources:
11/08/11 23:38:56    /p/condor/workspaces/jfrey/test.68/condor_config.local
11/08/11 23:38:56 DaemonCore: command socket at <128.105.165.29:37201>
11/08/11 23:38:56 DaemonCore: private command socket at <128.105.165.29:37201>

This banner includes the HTCondor version and platform, daemon pid, when the log file was last touched, config files, and the daemon's command port.

It's possible for multiple gridmanger processes to be writing to the log at the same time. So after the banner, the gridmanager includes its pid in square brackets on every line.

When the gridmanager starts managing a job (whether newly submitted or released from hold, it writes this to the log:

11/08/11 23:38:59 [15864] Found job 716.0 --- inserting

Lines in the log that relate to a specific job will often include the job id in parentheses:

11/08/11 23:38:59 [15864] (716.0) doEvaluateState called: gmState GM_CLEAR_REQUEST, globusState 32

With GRAM, whenever the jobmanager notifies the gridmanager of a job status change, it's logged:

11/08/11 23:39:10 [15864] (716.0) gram callback: state 1, errorcode 0

When a GAHP command fails, you can usually find a line mentioning it:

11/08/11 23:42:40 [15864] (716.0) gmState GM_RESTART, globusState 128: globus_gram_client_job_request() returned Globus error 93

When the gridmanager exits normally, it prints a goodbye message:

11/08/11 23:42:44 [15864] **** condor_gridmanager (condor_GRIDMANAGER) pid 15864 EXITING WITH STATUS 0

Verbose Logging Level

When something goes wrong, it's useful to increase the verbosity of the gridmanager's logging to collect more information. When you do so, you may also want to increase the maximum log size

GRIDMANAGER_DEBUG = D_FULLDEBUG
MAX_GRIDMANAGER_LOG = 50000000

This section will describe some of the additional information that will appear in the log when D_FULLDEBUG is enabled.

GAHP

All data exchanged with GAHP servers is logged. The gridmanager communicates with a GAHP server over a pair of pipes. The GAHP acknowledges the issuance of each command. The results of the command are supplied later, in response to a RESULTS command.

11/03/11 16:51:54 [29399] GAHP[29509] <- 'GRAM_JOB_SIGNAL 10632 https://stress2.
chtc.wisc.edu:46861/16217854812539030796/8230609966348670593/ 10 NULL'
11/03/11 16:51:54 [29399] GAHP[29509] -> 'S'
...
11/03/11 16:51:54 [29399] GAHP[29509] <- 'RESULTS'
11/03/11 16:51:54 [29399] GAHP[29509] -> 'R'
11/03/11 16:51:54 [29399] GAHP[29509] -> 'S' '1'
11/03/11 16:51:54 [29399] GAHP[29509] -> '10632' '0' '0' '8'

In this example, 29509 is the pid of the GAHP server, GRAM_JOB_SIGNAL is the type of command being issued, 10632 is a unique id for the command, and the rest of the line is arguments for the command. To learn more about the GAHP protocol, see http://www.cs.wisc.edu/condor/gahp

The stderr of the GAHP server is also logged:

09/21/11 15:26:28 [4205] GAHP[4209] (stderr) -> 2011-09-21 15:26:28,909 INFO - CreamProxy_Delegate::execute() - Signing returned proxy with our proxy certificate [/tmp/x509up_u17063]...

Status and attribute changes

The gridmanager logs when it observes the job's status as reported by the remote system change:

11/09/11 00:03:25 [16311] (717.0) globus state change: ACTIVE -> STAGE_OUT

11/09/11 00:17:39 [16741] (718.0) cream state change: IDLE -> REALLY-RUNNING

The gridmanager's connections to its schedd are also logged. This includes every job attribute it modifies.

11/03/11 16:44:47 [29399] in doContactSchedd()
11/03/11 16:44:47 [29399] Updating classad values for 714.933:
11/03/11 16:44:47 [29399]    GlobusStatus = 64
11/03/11 16:44:47 [29399]    GridJobStatus = "STAGE_IN"
11/03/11 16:44:47 [29399]    LastRemoteStatusUpdate = 1320356686
11/03/11 16:44:47 [29399]    NumGlobusSubmits = 1
11/03/11 16:44:47 [29399] leaving doContactSchedd()

State machine

The gridmanager maintains a state machine for each job. Transitions between states are usually triggered by events related to the job. Most states involve some action being taken on the job, with a transition triggered by completion of the action.

Any time something happens that may affect a job, it's state machine is evaluated by calling the function doEvaluateState().

11/03/11 16:46:56 [29399] (714.933) doEvaluateState called: gmState GM_DONE_COMMIT, globusState 8

This may result in one or more transitions in the state machine.

11/03/11 16:46:56 [29399] (714.933) gm state change: GM_DONE_COMMIT -> GM_DELETE

The state machine is different for each scheduler type. However, there is a lot of similarity. Here are the states a GRAM or CREAM job will transition through during a normal job execution.

GM_INIT: Finish initialization of job object
GM_START: Decide whether we have a new job or are recovering a previous job submission
GM_CLEAR_REQUEST: Clear job ad of any attributes relating to a previous job submission, wait for schedd to write changes to disk
GM_UNSUBMITTED: Job ad is clean and ready for a new submission attempt
GM_DELEGATE_PROXY: (CREAM only) Delegate proxy to CREAM, if not done already
GM_SET_LEASE: (CREAM only) Set job lease with CREAM, if not done already
GM_SUBMIT: Start job submission (CE can't start running job yet)
GM_SUBMIT_SAVE: Save GridJobId persistently in schedd
GM_STAGE_IN: (CREAM only) Transfer input files
GM_SUBMIT_COMMIT: Finish job submission (CE is free to start running job)
GM_SUBMITTED: Wait for CE to finish running job
GM_PROBE_JOBMANAGER/GM_POLL_JOB_STATE: Ask CE for current status of job, return to GM_SUBMITTED afterwards
GM_PUT_TO_SLEEP: (GRAM2 only) Kill jobmanager process
GM_JOBMANAGER_ASLEEP: (GRAM2 only) Jobmanager is dead, listening to grid monitor
GM_RESTART: (GRAM2 only) Restart jobmanager, usually because job is done
GM_RESTART_SAVE: (GRAM2 only) Save new jobmanager URL in schedd
GM_RESTART_COMMIT: (GRAM2 only) Tell restarted jobmanager to start managing job
GM_CHECK_OUTPUT: (GRAM only) Verify output is completely transferred
GM_STAGE_OUT: (CREAM only) Transfer output files
GM_DONE_SAVE: Record job as completed in schedd
GM_DONE_COMMIT: Tell CE to clean up job state
GM_DELETE: Delete job object in gridmanager (terminal state)

If an error occurs, the job may end up in one of these states:

GM_CANCEL: Remove job from CE
GM_HOLD: Place job on hold for user or admin to debug