condor_stats, KEEP_POOL_HISTORY, CondorView, viewhist

Wisdom on the use and operation of condor_stats, based on e-mail by Alan De Smet during December 2003 and January 2004.

      (see the original email from Todd about it in /p/condor/workspaces/jepsen/src_java/condor/condorview/todd.inst)

-orgformat

-orgformat only affects those query types which do not end with "list". The only difference between -orgformat and the default is the first column. To determine what is in the default, look at the orgformat, remove everything up to and including the first colon, and replace it with the percentage of time. So, for example, the -resourcequery -orgformat might include the line:
  1074095821      puffin.cs.wisc.edu      :       37590     1.000 3

That's time in seconds since the epoch, machine name, ":", idle time in seconds, load, and machine state as an integer. Going back to the default (removing the orgformat), we get:

  79.779999       37590   1.000000        CLAIMED

Everything in up and including the colon has been replaced with the percentage time. (You may also notice that the machine state has been converted from a number to a string. This is a special case in the condor_stats code and shouldn't happen for other queries.)

The -orgformat output to various query types directly correspond to log files in POOL_HISTORY_DIR on the view collector. You can effectively replicate the query by grepping through the appropriate file. The mappings are as such:

Command Data file
-userlist viewhist.0.*
-userquery viewhist.0.*
-resourcelist viewhist.1.*
-resourcequery viewhist.1.*
-resgrouplist viewhist.2.*
-resgroupquery viewhist.2.*
-usergrouplist viewhist.3.*
-usergroupquery viewhist.3.*
-ckptlist viewhist.4.*
-ckptquery viewhist.4.*

The second number is the granularity of data. The *.0 file is the highest sampling frequency but shortest period covered while the *.2 is the lowest sampling frequency but the longer period covered. The *.0 file contains samples every 4*POOL_HISTORY_SAMPLING_INTERVAL seconds. The *.1 files contain samples 1/4th as often as the *.0 files, while the *.2 files contain samples 1/4th as often as the *.1 files (or 1/16th as often as the *.0 files).

As a given written sample represents at least 4 samples and as many as 64, the sub samples (taken every POOL_HISTORY_SAMPLING_INTERVAL seconds) are averaged together. So a single entry in a *.0 file is the average of 4 samples, while a single entry in the *.2 file is the average of 64 samples.

File format

This is the format of the various viewhist.*.* files. Because -orgformat returns the same information, this is also the format of -orgformat's output. In the actual output fields are seperated by spaces, records are seperated by newlines.

viewhist.0.* / -userquery -orgformat

1071109949      adesmet@cs.wisc.edu/puffin.cs.wisc.edu      :       16      0

viewhist.2.* / -resgroupquery -orgformat

1055836559      Total   :       55.0    0.8     729.8   0.8     83.8
1055836559      INTEL/LINUX     :       43.8    0.8     578.8   0.8     20.0

viewhist.1.* / -resourcequery -orgformat

1074101829      p66.cs.wisc.edu :       30179368          0.130 3

viewhist.4.* / -ckptquery -orgformat

1057703428      toucan.cs.wisc.edu      :       45.379  136.138 1106.393        8196.154

viewhist.3.* / -usergroupquery -orgformat

1072743565      matthew@cs.wisc.edu     :       3       22

Query Types

Command Query name Data Name File
Line QUERY_HIST_* *Data viewhist.#
-userlist SUBMITTOR_LIST Submittor 0
-userquery SUBMITTOR Submittor 0
-resourcelist STARTD_LIST Startd 1
-resourcequery STARTD Startd 1
-resgrouplist GROUPS_LIST Groups 2
-resgroupquery GROUPS Groups 2
-usergrouplist SUBMITTORGROUPS_LIST SubmittorGroups 3
-usergroupquery SUBMITTORGROUPS SubmittorGroups 3
-ckptlist CKPTSRVR_LIST Ckpt 4
-ckptquery CKPTSRVR Ckpt 4

(The file viewhist entry is the first number in the file. The second number is the archive number used when the logs roll over.)