condor_stats, KEEP_POOL_HISTORY, CondorView, viewhist
Wisdom on the use and operation of condor_stats, based on e-mail by Alan De Smet during December 2003 and January 2004.
- condor_stats requires that the collector that you're querying have KEEP_POOL_HISTORY turned on.
- The first field in the output from resourcequery is a percentage through the requested data set. Thus the first entry will have a value close to 0.0, while the last will be close to 100.0.
- If you want time stamps (instead of percentage of data sets), use "-orgformat" which will present the timestamps in seconds since the Unix epoch. The fields are: Timestamp in seconds, machine name, ":", idle time in seconds, load, and the machine state encoded as a number. The machine state coding is:
- unclaimed
- matched
- claimed
- pre-empting
- owner
- shutdown
- delete
- backfill
- Note that the machine state coding is replicated in several locations. In addition to adding a new machine state in src/ condor_includes/ condor_state.h, the new machine state must also be added to the Collector View server in src/condor_collector.V6/view_server.h, src/condor_collector.V6/view_server.C; and to condor_stats in src/condor_tools/stats.C.
- The -to and -from options measure time from the start of the date. So "-from 11 30 2003 -to 12 1 2003" will show data for the 30th of November.
- Actually generating the results is done in the collector code, not in condor_stats. In most cases whatever condor_collector sends back is dumped directly to the output. The only exception is -resourcequery but not -orgformat, in that one case the the output is tweaked (to convert machine states from numbers to strings). condor_collector.V6/view_server.C is where most of the logic is.
- The "
CondorView
server" shell scripts that generate the HTML pages on the View Server pages were re-written in C a long time ago by a student hourly. I think the source is here: /p/condor/workspaces/jepsen/src_java/condor/condorview/viewNT
(see the original email from Todd about it in /p/condor/workspaces/jepsen/src_java/condor/condorview/todd.inst)
- "Query type" consisists of one of the following options: -resourcelist, -resourcequery, -resgrouplist, -resgroupquery, -userlist, -userquery, -usergrouplist, -usergroupquery, -ckptlist, -ckptquery.
- You must have one query type specified.
- You can specify only one query type. If multiple queries are specified, only the last one takes effect. (In the future it is likely that condor_stats will exit with an error in this case.)
- The non-"list" options require another argument specifying the query. There doesn't appear to be a way to default to the local machine or the current user. The argument is the exact second field in the record. See the -orgformat notes below, or this summary. The examples below are confirmed to work on our pool (which is why I chose them)
- -userquery email_address/submit_machine
- Example: adesmet@cs.wisc.edu/puffin.cs.wisc.edu
- -resourcequery hostname
- Example: p22.cs.wisc.edu
- -resgroupquery Architecture/Operating System or "Total"
- Example: INTEL/LINUX
- Example: Total
- -usergroupquery email_address or "Total"
- Example: adesmet@cs.wisc.edu
- Example: Total
- -ckptquery hostname
- Example: toucan.cs.wisc.edu
- -userquery email_address/submit_machine
- Things that will cause condor_stats to abort with the usage message:
- Failure to specify a query type.
- Failure to pass additional information to arguments that require it (Most arguments demand this. For example, resourcequery and from require an additional argument.)
- A start date prior to the Unix epoch (Midnight UTC, Jan 1, 1970). This would typically be set with -from
- A finish date in that is in the future. This would typically be set with -to
- A finish date before the start date.
- All queries have a time range. If not specified, the end time defaults to "now", the start time defaults to 1 day (86,400 seconds) ago. Thus, "-lastday" is effectively the default time range.
- You can only specify the start time once. Similarlly with the end time. If multiple times are specified, only the last one takes effect. -to sets the start time. -from sets the end time. The following set both start and end time: lastday, lastweek, lastmonth, lasthours.
-orgformat
-orgformat only affects those query types which do not end with "list".
The only difference between -orgformat and the default is the first column. To determine what is in the default, look at the orgformat, remove everything up to and including the first colon, and replace it with the percentage of time. So, for example, the -resourcequery -orgformat might include the line:
1074095821 puffin.cs.wisc.edu : 37590 1.000 3
That's time in seconds since the epoch, machine name, ":", idle time in seconds, load, and machine state as an integer. Going back to the default (removing the orgformat), we get:
79.779999 37590 1.000000 CLAIMED
Everything in up and including the colon has been replaced with the percentage time. (You may also notice that the machine state has been converted from a number to a string. This is a special case in the condor_stats code and shouldn't happen for other queries.)
The -orgformat output to various query types directly correspond to log files in POOL_HISTORY_DIR on the view collector. You can effectively replicate the query by grepping through the appropriate file. The mappings are as such:
Command | Data file |
-userlist | viewhist.0.* |
-userquery | viewhist.0.* |
-resourcelist | viewhist.1.* |
-resourcequery | viewhist.1.* |
-resgrouplist | viewhist.2.* |
-resgroupquery | viewhist.2.* |
-usergrouplist | viewhist.3.* |
-usergroupquery | viewhist.3.* |
-ckptlist | viewhist.4.* |
-ckptquery | viewhist.4.* |
The second number is the granularity of data. The *.0 file is the highest sampling frequency but shortest period covered while the *.2 is the lowest sampling frequency but the longer period covered. The *.0 file contains samples every 4*POOL_HISTORY_SAMPLING_INTERVAL seconds. The *.1 files contain samples 1/4th as often as the *.0 files, while the *.2 files contain samples 1/4th as often as the *.1 files (or 1/16th as often as the *.0 files).
As a given written sample represents at least 4 samples and as many as 64, the sub samples (taken every POOL_HISTORY_SAMPLING_INTERVAL seconds) are averaged together. So a single entry in a *.0 file is the average of 4 samples, while a single entry in the *.2 file is the average of 64 samples.
File format
This is the format of the various viewhist.*.* files. Because -orgformat returns the same information, this is also the format of -orgformat's output. In the actual output fields are seperated by spaces, records are seperated by newlines.
viewhist.0.* / -userquery -orgformat
1071109949 adesmet@cs.wisc.edu/puffin.cs.wisc.edu : 16 0
- Timestamp measured in seconds since the Unix epoch
- user_email_address/submit_machine
- :
- Average
JobsRunning
as integer - Average
JobsIdle
as integer
viewhist.2.* / -resgroupquery -orgformat
1055836559 Total : 55.0 0.8 729.8 0.8 83.8 1055836559 INTEL/LINUX : 43.8 0.8 578.8 0.8 20.0
- Timestamp measured in seconds since the Unix epoch
- machine type (Architecture/Operating System) or "Total" for all machines
- :
- Average Machines reporting unclaimed state as floating point number with one decimal place
- Average Machines reporting matched state as floating point number with one decimal place
- Average Machines reporting claimed state as floating point number with one decimal place
- Average Machines reporting preempting state as floating point number with one decimal place
- Average Machines reporting owner state as floating point number with one decimal place
viewhist.1.* / -resourcequery -orgformat
1074101829 p66.cs.wisc.edu : 30179368 0.130 3
- Timestamp measured in seconds since the Unix epoch
- startd machine name
- :
- Average Keyboard Idle in seconds as integer
- Average Load Average as floating point number with 3 decimal places
- Last Machine State as integer
viewhist.4.* / -ckptquery -orgformat
1057703428 toucan.cs.wisc.edu : 45.379 136.138 1106.393 8196.154
- Timestamp measured in seconds since the Unix epoch
- checkpoint machine name
- :
- Average Bytes Received as floating point number with 3 decimal places
- Average Bytes Sent as floating point number with 3 decimal places
- Average Receive Bandwidth as floating point number with 3 decimal places
- Average Send Bandwidth as floating point number with 3 decimal places
viewhist.3.* / -usergroupquery -orgformat
1072743565 matthew@cs.wisc.edu : 3 22
- Timestamp measured in seconds since the Unix epoch
- user address
- :
- Average Jobs Running as integer
- Average Jobs Idle as integer
Query Types
Command | Query name | Data Name | File |
Line | QUERY_HIST_* | *Data | viewhist.# |
-userlist | SUBMITTOR_LIST | Submittor | 0 |
-userquery | SUBMITTOR | Submittor | 0 |
-resourcelist | STARTD_LIST | Startd | 1 |
-resourcequery | STARTD | Startd | 1 |
-resgrouplist | GROUPS_LIST | Groups | 2 |
-resgroupquery | GROUPS | Groups | 2 |
-usergrouplist | SUBMITTORGROUPS_LIST |
SubmittorGroups |
3 |
-usergroupquery | SUBMITTORGROUPS |
SubmittorGroups |
3 |
-ckptlist | CKPTSRVR_LIST | Ckpt | 4 |
-ckptquery | CKPTSRVR | Ckpt | 4 |
(The file viewhist entry is the first number in the file. The second number is the archive number used when the logs roll over.)