Ticket #2197: General statistics architecture for Condor

We would like to be able to gather statistics for the purpose of improving the Condor software and providing guidance to users. The data is to give us insight in to what the daemon is doing, and in particular intended to help us determine the reasons for badput.

General Guidelines:

We will do the bare minimum of processing of data. The onus of processing will be on the user. Thus the data we collect will observe the following constraints.

Calls to increment and update raw statistical values will need to be added in various places in the code that we are trying to measure, but an effort should be made to minimize the amount of code needed outside of the statistics module.

It will be straightforward to change or add new statistical values, and to change which ClassAds statistics are published into.

Implementation:

A module will be created to aid in the storage, updating and publishing of statistics. Storage of temporary values and much of the code for windowing and calculating statistical data will be collected into a class or set of classes. The class will have a simple interface for callers to increment various counters, and to publish the collected values.

There will be three main types of statistical values

In most cases statistical values will be available as both Simple values and Recent window values.

The statistics will be collected internally in the daemons, and published in the daemons' ClassAds. The collection will occur in a statically allocated table, with the exception of the binning for the lists (e.g., in JobLifetimes). The binning is envisaged to be a config knob.

The initial implementation will define what statistics exist and where they will be published at compile time. But it is expected that we may want to make changes at run-time in the future, so care will be taken to not preclude that.

Because the changes #2006 have already been shipped by Red Hat, we will make it easy to publish some of the values under the names indicated in #2006 as well as under the names given above. This behavior will be controlled by a new param called PUBLISH_STATISTICS_USING_OLD_NAMES_ALSO.

Proposed values:

Milestones

  1. 20110909 Implement the generic plumbing, and simple and windowed stats for the schedd.
  2. 20110817 #2354 Statistics in DaemonCore
  3. 20110819 #2353 Statistics histogram
  4. 20110915 #2355 Unit tests for Ring buffer implementation
  5. 20110915 #2356 Condor tests for statistics implementation
  6. 20110830 #2429 create AddingStatisticsValues page to document how to use the classes in generic_stats.h
[Append remarks]

Remarks:

2011-May-31 14:30:44 by eje:
Performance stats are fairly daemon-specific in my experience. What is the expected payoff for a generic cross-daemon 'stats collecting' object and/or service?

As part of #2006 I implemented a template widget called timed_queue<> which can be limited by either a time-window or a particular size (or both) -- it provides ring-buffer functionality. It also demonstrates some basic accumulation methods:

.../src/condor_utils/timed_queue.h


2011-May-31 14:54:48 by nwp:
Payoff from having cross-daemon object/service is that it is something that the schedd does not have to do; It seems that the schedd is getting overloaded, and not adding more to it seems like a good idea. Also, such a service could communicate with the collector and negotiator, rather than being part of the schedd.

As it is, the timed_queue has got two objections that I have heard here around UW: first, it is not a priori bounded in memory usage; second, there is an objection to deriving from std::deque, because condor is not set up now to handle C++ exceptions very well.


2011-May-31 17:31:32 by eje:
Regarding timed_queue<> questions:

1) It's memory can be bounded in a well defined way using the max_len() method, if desired.

2) Regarding exceptions, the STL data structures that are used in the condor code already include string, vector<> and map<>. Adding deque<> to that list doesn't introduce a qualitative change.


2011-May-31 17:39:31 by eje:
Regarding schedd computational load, it doesn't feel obvious to me that updating a given statistic is more compute-expensive than executing the machinery of sending data to a third party so the third party can maintain it.


2011-Jun-01 12:04:26 by nwp:
Yes, STL has snuck into Condor, for better or worse. Hopefully, the upside is better than the downside.

I do not know that the statistics daemon is a good idea; I am just throwing it out there. I could see some lunatic asking for statistics 30 times/sec and then it would probably make sense to offload it from the schedd.


2011-Jun-01 12:18:21 by tstclair:
I think we need to create some clear objectives & requirements, based on use cases, similar to some of the other tickets. It might become more obvious what the best path will be then. As it stands, this is pretty nebulous, and prone to conjecture atm. It might be worth a mtg with some folks to gather/define requirements.


2011-Jun-01 12:19:06 by nwp:
Amen


2011-Jun-03 10:20:39 by bcotton:
From our Research Services group, who do the statistical finagling:

There are three distinct types of things that we cannot (to our knowledge) get from Condor right now that would be very helpful:

1. Full job run attempt history. Right now, all that is logged in the history log is the last, second-to-last, and some info about the first run attempted. A lot of the discussion and proposal seems to be centered around the notion of reporting statistics on a job. We don't want or need Condor to do statistics--we have real statistics packages for that. What we would like is for Condor to log the raw data those need about every attempted run.

2. Information on jobs in progress or in queue. Right now, history only logs anything about a job after it completes. This is very problematic. It would be much more useful to get information about jobs before they complete, as things happen. This really is a fundamental shift in thinking about logging, from a job-based log to an event-based log. Even if something wholly different from the current history log, a real event-based log would be far more useful. Many jobs will stay in progress for days or even weeks, and not being able to report on them until they finish makes truly accurate reporting on the current state of the system simply impossible.

3. Centralized logging. Being reliant on access to logs from the schedd hosts is a really big problem. It means that we cannot get accurate information on what our pool resources are doing if any of them allow jobs from hosts outside of our control, either directly or through flocking. This creates a huge disincentive for anyone to allow any jobs from outside of their own organization, because they have no way of tracking what work or even how much work they are supporting by those outside schedds. Note, this doesn't have to be a fancy DB-based custom system. Simply having event-based remote syslog support would solve all of (1), (2), and (3) above.


2011-Jun-03 10:36:52 by tstclair:
bcotton - love the comments! and there may be 2-3 side-tickets just from that!

But I'm not sure it relates 100% to the root of this specific ticket. For example, most of the stats we are looking to obtain, or already exist, are internal performance measurements of the code itself. The question(s) we've been trying to figure out are:


2011-Jun-03 11:06:06 by eje:
bcotton, fyi - I think some of what you want will be available via new schedd stats in 7.7 (#2006)


2011-Jun-04 12:16:44 by eje:
see also: #2211 (collector stat rfe)


2011-Jun-06 13:35:56 by gthain:
Note that I'm not worried about using deque, but deriving from std::deque is a bit hinky, as the stl containers aren't really designed for this.


2011-Jun-06 16:04:43 by johnkn:
I propose that the sort of statistics that are relevant to this ticket are.


2011-Jun-06 16:17:37 by psilord:
Statistical data in which I have been personally interested:

  1. Job completion rate per user and total over all users
    • It is useful because it tells me if all of Condor is somehow being affected by some environment problem or just a particular user is being affected.
  2. I/O rate in/out per user and total over all users
    • When a machine is "slow" I can look to see who might be responsible for it since usually it is I/O related. I can see if the current network bandwidth of the machine is a limiting factor to the rate of job completion.
  3. Global reconnect rate of shadows started up by schedd
    • This tells me if the submit machine is in an odd state where the shadows aren't staying active, continuously reconnecting correctly, but then going away again. It is usually indicative of high load related to process churn.


2011-Jun-10 14:28:38 by tstclair:
In doing some recent testing it would be nice to see:

ShadowsRunning

ShadowsRunningCumulative

ShadowsRecycled

ShadowsRecycledCumulative

because exit codes only tell 1/2 the story ;-)



2011-Sep-07 16:24:15 by nwp:
I wonder if we should scope the enumerations in generic_stats.h, e.g.,
+enum {
+   STATS_ENTRY_TYPE_INT32 = 1,
+   STATS_ENTRY_TYPE_INT64 = 2,
+   STATS_ENTRY_TYPE_FLOAT = 1 | 4,
+   STATS_ENTRY_TYPE_DOUBLE = 2 | 4,
+   STATS_ENTRY_TYPE_UNSIGNED = 8,
+   STATS_ENTRY_TYPE_UINT32 = STATS_ENTRY_TYPE_INT32 | STATS_ENTRY_TYPE_UNSIGNED,
+   STATS_ENTRY_TYPE_UINT64 = STATS_ENTRY_TYPE_INT64 | STATS_ENTRY_TYPE_UNSIGNED,
+   };
and such, in namespaces, to avoid collisions


2011-Sep-07 16:27:40 by nwp:
Why are we using pointer-to-member rather than inheritance-plus-virtual functions?


2011-Sep-08 09:43:52 by johnkn:
pointer-to-member allows you to have multiple Publish methods in your class, and pass a pointer to the one you want to use when you register the probe with the pool. You could even register the probe multiple times with a different attrib name and Publish method each time.

So if we need to have backward-compatible names for values, we can do that.


2011-Dec-13 09:30:37 by nwp:
The exact format and options in STATISTICS_TO_PUBLISH should be put into the manual, as they are not there now.
[Append remarks]

Properties:

Type: enhance           Last Change: 2013-Jan-07 16:14
Status: stalled          Created: 2011-May-27 08:52
Fixed Version: v070702           Broken Version:  
Priority:          Subsystem: Daemons 
Assigned To: johnkn           Derived From: #2006
Creator: nwp  Rust:  
Customer Group: other  Visibility: public 
Notify: johnkn@cs.wisc.edu, eje@redhat.com, tstclair@redhat.com, epaulson@cs.wisc.edu,tannenba@cs.wisc.edu  Due Date: 20111212 

Derived Tickets:

#2211   Publish performance stat attributes for the Collector
#2353   Statistics histogram
#2354   Statistics in DaemonCore
#2355   Unit tests for Ring buffer implementation
#2356   Condor tests for statistics implementation
#2429   create a wiki page showing how to add statistics in the condor code
#2474   Convert ad-hoc statistics in various daemons to use generic_stats
#2732   Create a new condor_reset_stats tool
#2733   Incorporate statistics into the startd
#2783   Add attributes to JobAd to show time spent in file transfer
#2862   Schedd should collect statistics on sub-sets of completed jobs.
#3288   RFE: expose general_stats 'recent' ring buffer quantization to configu
#3428   Create unit tests for ring_buffer code used by daemon statistics
#3443   daemon abort after changing STATISTICS_WINDOW_SECONDS, reconfig

Related Check-ins:

2013-Jan-15 09:31   Check-in [34639]: fix statistics ring buffer so that it no longer aborts after a SetSize operation that reduces the size. #3443. This should fix the ring buffer problem in mentioned in #3428 also. parent ticket #2197 ===VersionHistory:Pending=== (By John (TJ) Knoeller )
2012-May-09 10:14   Check-in [31971]: fix schedd stats code merged from 7.8 to add 2nd argument needed for 7.9 #2197 ===VersionHistory:None=== (By John (TJ) Knoeller )
2012-May-08 15:30   Check-in [31954]: fix for Recent values not updating for SCHEDD_COLLECT_STATS_FOR_xxx type stats in the schedd. #2197 ===VersionHistory:Pending=== (By John (TJ) Knoeller )
2012-Feb-28 10:05   Check-in [30729]: fix basic file transfer statistics #2197 to account for jobs failing so early that they never begin to execute. ===VersionHistory:None=== (By John (TJ) Knoeller )
2012-Jan-25 17:09   Check-in [29218]: Add basic schedd statistics for time spent in file transfer vs. time spent executing. for #2197 new attributes are JobsAccumPreExcuteTime, JobsAccumExecuteTime, JobsAccumPostExecuteTime and Recent flavors ===VersionHistory:Pending=== (By John (TJ) Knoeller )
2011-Dec-14 14:46   Check-in [28776]: specified -statistics command line option for condor_status ===GT=== #2197 (By Karen Miller )
2011-Dec-14 12:29   Check-in [28772]: add definitions of new config knobs STATISTICS_TO_PUBLISH and STATISTICS_WINDOW_SECONDS ===GT=== #2197 (By Karen Miller )
2011-Dec-13 13:54   Check-in [28743]: addition of schedd statistics attribute definitions related to histograms ===GT=== #2197 (By Karen Miller )
2011-Dec-12 15:01   Check-in [28742]: more changes of schedd statistics ClassAd attribute definitions ===GT=== #2197 (By Karen Miller )
2011-Dec-09 15:25   Check-in [28700]: more changes to defns of schedd statistics attributes ===GT=== #2197 (By Karen Miller )
2011-Dec-08 12:30   Check-in [28672]: further edits to new schedd statistics attribute definitions ===GT=== #2197 (By Karen Miller )
2011-Dec-06 16:07   Check-in [28638]: Remove 3 statistics attributes that are no longer. ===GT=== #2197 (By Karen Miller )
2011-Dec-06 15:39   Check-in [28636]: Identify those attributes which are in the Scheduler ClassAd when the verbosity level of STATISTICS_TO_PUBLISH is high enough. ===GT=== #2197 (By Karen Miller )
2011-Dec-06 10:09   Check-in [28619]: Documentation for verbose scheduler statistics #2197 ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Dec-06 09:47   Check-in [28617]: Documentation for basic scheduler statistics #2197 attributes ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Sep-30 10:58   Check-in [27542]: Add support for -statistics verbosity override to the condor_status -direct command handler in the startd. Also add some verbose mode statistics for startd ResMgr. ===GT=== #2197 ===VersionHistory:Pending=== (By John (TJ) Knoeller )
2011-Sep-30 10:47   Check-in [27540]: Add Publish with config method to schedd stats and daemon core stats #2197 so that a caller can request to override the publish statistics flags defined in the config file. This is added to enable condor_status -direct -statistics or a new condor_stats command. ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Sep-22 09:16   Check-in [27326]: Add Publish with config method to schedd stats and daemon core stats #2197 so that a caller can request to override the publish statistics flags defined in the config file. This is added to enable condor_status -direct -statistics or a new condor_stats command. ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Sep-16 14:30   Check-in [27260]: Fix daemon core statistics publishing in schedd so that dc statistics are not published in the submitter ad. ===GT=== #2197 ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Sep-15 15:17   Check-in [27233]: Add params to control generic statistics publishing and recent window STATISTICS_TO_PUBLISH parameter for generic statistics #2197 and STATISTICS_WINDOW_SECONDS parameter to control the window size for generic statistics DCSTATISTICS_WINDOW_SECONDS parameter to control the window size for daemon core [...] (By John (TJ) Knoeller )
2011-Sep-13 13:31   Check-in [27232]: disable PublishDebug for DC Statistics (#2354) and change default window size for schedd statistics (#2197) from 5 min to 20 min. ===VersionHistory:None=== code has not yet shipped (By John (TJ) Knoeller )
2011-Sep-07 14:15   Check-in [27116]: Created generic StatisticsPool class (#2197) and modifed schedd_stats and daemoncore stats (#2354) to use it. removed code made obsolete as a result. [...] (By John (TJ) Knoeller )
2011-Sep-07 12:25   Check-in [27113]: remove dead code for static publishing tables from generic statistics #2197 move experimental Probe class from daemon core to generic statistics #2354 ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Sep-06 17:24   Check-in [27107]: prune code from generic_stats that is no longer used. #2197 change the name of the stats_pool class to StatisticsPool. ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Aug-31 18:55   Check-in [27048]: change schedd statistics to use a stats_pool rather than static const array #2197 prune some of the generic statistics code that is no longer used. some improvements to stats_pool class to support schedd stats ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Aug-31 16:27   Check-in [27047]: remove dead code from generic stats #2197 and Daemon Core stats #2354 ===VersionHistory:None=== (By John (TJ) Knoeller )
2011-Aug-31 12:50   Check-in [27042]: implement Remove method of stats_pool #2197 (By John (TJ) Knoeller )
2011-Aug-30 20:01   Check-in [27036]: create a generic statistics pool class #2197 and refactor DC statistics to use it. #2354 (By John (TJ) Knoeller )
2011-Aug-26 14:14   Check-in [27034]: change from flags to an enum for statistics entry classes, #2197, also convert Schedd statistics to using the newer publishing structures. this checking changes the effective names of some of the SCHEDD statistics entries. ===VersionHistory:Pending=== (By John (TJ) Knoeller )
2011-Aug-24 16:53   Check-in [26964]: Add pool of dynamic named statistics entries to DaemonCore, #2354 also refactor generic statistics (#2197) to allow greater flexibilty and type safety in support of dynamic named statistics. ===VersionHistory:Pending=== (By John (TJ) Knoeller )
2011-Aug-15 14:44   Check-in [26775]: move version history for statistics (#2197) from 7.7.1 section to 7.7.2 where it belongs. ===VersionHistory=== none (By John (TJ) Knoeller )
2011-Aug-15 14:35   Check-in [26773]: added preliminary version history for schedd statistics. ===GT=== #2197 ===Doc=== Pending (By John (TJ) Knoeller )
2011-Aug-04 12:52   Check-in [26624]: Initial changes for generic statistics architecture #2197 (By John (TJ) Knoeller )

Attachments: