Page History

Turn Off History

Notes re Miron and Igors Streaming CE Concerns

[from Miron's visit to San Diego March 2012]

Before going in streaming CE details, let me just point out that

  1. The vanilla universe is already doing "streaming"!
  2. Jobs can restart multiple times, due to preemption. Condor is just not making restarts a first-class citizen.
  3. Most of what we describe below would actually make sense in the vanilla universe as well.

Miron's main concern is how we handle edge cases; everything else is easy The two main problems we spotted:

  1. What to do if the client goes away?
    • - We of course have a lease. But one may not be enough.
    • - We want the classic lease with the semantics of "If you don't hear from me before it expires, get away from me all the resources and clean up after me"
    • - However, that lease must be long, to prevent resources to be de-provisioned for short outages at client (say, 1 or 2 days)
    • - If the client is down, however, the provisioning of new resources may not work either (i.e. jobs fail). For example, the client had a Web server that is now down, and that is used during job startup (glideinWMS case).
    • - So we want at least another lease, with the semantics "If you don't hear from me before it expires, do not provision new resources (but do keep the existing ones)" This one can be reasonably short lived (tens of minutes-hours).
    • - Anything else?
    • - PS: Client should provide the desired lease time(s), but server must be able to shorten them, if so desired (-> notify client)
  2. What to do if all/most jobs finish at the same time and all restarts fail?
    • - Unmanaged, this would put a high load on site batch system! - Please notice this is currently not a problem for glideinWMS factory, because of the "slowness" of the CE
    • - Putting limits on restarts of a single job may not be enough.
    • - One job out of 1k restaing once every 10 mins is OK (e.g. broken WN).
    • - 1k jobs all restarting every 10 mins is not OK (over 1Hz)
    • - We need to correlate the various requests as much as possible, and have aggregate limits
    • - This goes against the idea of "one Condor-G job per request"
    • - I have no obvious solution right now

Then there are "easier" problems:

  1. Do we really want to "fly blind" and treat any "resource provisioning job" termination as equal?
    • - We should probably natively distinguish at least
      • + Failed at initialization (/validation/...)
      • + Failed, but after doing some useful work
      • + Was never claimed
      • + Right now no work for me (but did some useful work before)
      • + No problems (just being nice to the site so it can handle fair share)
      • + Anything else (what just happened here????)
    • - Would need to standardize how we convey this
      • + Is exit code the right thing?
      • + Just first approximation (e.g. No problem/problem)? And we have a different mechanism for details?
      • + Should we go even more fancy, and have actual complex policies? (I am not advocating it for "Condor-G streamin", but it came up in the passing and may make sense in the generic vanilla universe)
    • - Once we know the above, we should probably throttle restarts of anything but "No problems"
  2. Restart limits (related to above)
    • - Even for "No problems", we want the client to provide a max restart rate (e.g. no more than 1 per hour)
      • + To deal with bugs in the pilot code itself
    • - But we want the client to provide the limits for the other use cases as well
      • + Should we have one for each of the above categories?
      • + Or just one "problem limit"?
      • + What is a reasonable limit? e.g. No more than 1 every 20 mins? (this is fine for "broken node" handling, but maybe not for e.g. "Was never claimed")
    • - And let's not forget the "group throttling "described above
    • - PS: Client should provide the desired limits, but server must be able to change them, if so desired (-> notify client)
  3. What about getting back the output sandboxes?
    • - Regarding reliability
      • - Igor has always insisted on getting all the output sandboxes back, because of the valuable monitoring information in them. After some grilling by Miron, Igor had to admit this is not completely true; what Igor really want to get "most of them" back. I.e. he can live with a small loss of output sandboxes (say 1%), as long as it is truly random, but wants to avoid losing most of the outputs for certain events (of my jobs), even if they are relatively rare.
    • - Obvious problem is, what to do when the server gets out of space (or quota)
      • - Should it stop starting new jobs, or should it throw away the oldest sandboxes?
      • - The client should have a say about this.
    • - However, it may get more interesting than this
      • - Do we want to prioritize deletion based on the "termination mode"?
      • - e.g. Maybe we are willing to loose a fraction of "Was never claimed" sandboxes, but really want all "No problem" ones back. Or maybe our tolerance for the fraction we can loose of the different kind is different.
    • - BTW: The client may have a similar problem on its own end, but I am not sure this is relevant in this context.

Finally, Miron wanted to see all the above discussed/digested in the "client land", i.e. how we express all of the above, before even attempting to go into how we express this in a RPC-like protocol.