Page History
Network Related Policy for Condor
Introduction
Condor is a software that evaluates the mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Current Condor system can match the submitted jobs and the available machines in the Condor pool in terms of the available computing resources. However, it does little integration and management of the network layer. Scheduling decisions in Condor are made without considering the underlying network capacities and conditions. It is highly possible that Condor may match submitted jobs with large input file to a remote node with little bandwidth. To handle this problem, we introduce the network related policy for Condor. We start with some user case examples, in which the network layer knowledge is taken into consideration when the users submit the jobs.
User Case Example
In this section, we demonstrate three possible user cases, in which the submitted jobs have specific network condition requirements. These user case examples will help the readers understand why Condor needs to incorporate network layer knowledge.
- Example 1: The submitted job requires potentially intensive network and data interactions with other machines distributed all around the world. The submitted job running on Condor is like a service provided to its subscribers. For instance, the running job could be a face recognition system. Users can submit their individual testing samples to the system independently. The system would process all the incoming queries, analyze them and give recognition result to each request. There is a large scale of users, who would generate a large amount of queries at one time. The throughput of the incoming data can be estimated to an approximate value, thus, the network bandwidth of the remote node running the job must be larger than the possible throughput; otherwise, the users will experience noticeable latency when waiting for the results.
- Example 2: The submitted job requires an IPv6 network. It will also have a lot of communications with the machines in some certain VLAN in the Condor pool. It has requirements on both inbound connectivity and outbound connectivity. (Bridge is preferred in network setup stage). For instance, the submitted job is a scientific simulation, which needs to read mass data as input to the simulation from the machines within the same VLAN. The experimental data is distributed among all the machines in the VLAN. Therefore, it is better for the new network space which host the running job to appear as if it was a physical host on the network.
- Example 3: The purpose of submitting the job is to analyze the characteristics of the network traffic for a certain task. The user is interested in the network load, incoming traffic and outgoing traffic during a certain amount of time. It also has a requirement in the bandwidth. For example, the submitted running job could be a social network application. There are a lot of interactions between the server and the users. By observing the network load, we can gain some insights about when the users are the most active/inactive and when the server would have the most burden, etc.
Job ClassAds
This section describes the corresponding Job ClassAds that advertises the user job's preferences and requirements. The list of Job ClassAds are demonstrated below:
- IPProtocol ---- This attribute indicates which IP protocol the users want to use with their network related jobs. There are two options for this attribute: "IPv4" and "IPv6". The attribute value is supposed to be string, which is double quoted characters. For instance, IPProtocol="IPv6".
- RequestBandwidth ---- This attribute indicates the required bandwidth the users want to have for the network bandwidth when their submitted jobs are running in the condor pool on the matched machines. The job can be matched on some specific machine only if the machine can provide network bandwidth larger than the required value during the job execution. The attribute value is a real number, and the unit for bandwidth is ‘Mbps’. The unit is omitted. For example: BandWidth = 5.5, simply means that the required bandwidth is 5.5Mbps. In reality, RequestMaxBandwidth and RequestMinScheddBandwidth are used.
- NetworkAccounting ---- This attribute indicates whether the users want the Condor starter to invoke the network accounting functionality. The value could be TRUE or FALSE. For example: NetworkAccounting = TRUE. In real scenarios, NetworkLoad, NetworkIn, NetworkOut are provided.
- NetworkSetup ---- This attribute determines how to setup the network for the purpose of network accounting when the job is running in Condor pool. The value can be "Bridge" or "NAT". The users can use InboundConnectivity (True/False) and OutboundConnectivity (True/False) to advertise its preference on the network setup.
- PreferVLAN ---- This attribute determines which VLAN the users want their jobs to run in. The value of this attribute is string. There should be some predefined VLAN names that are known to the users. For instance, VLAN = "CMS" means the user wants to have its job running on a machine in the CMS network. In this case, only the machines in this specific VLAN are possible to be matched to run the job.
- PreferDomain ---- This attribute indicates the preferred top-level domain name corresponding to the IP address of the machine that runs the users’ submitted jobs. For instance, Domain = "hcc.unl.edu" indicates that user prefer to use the machines from Holland Computer Center; Domain = "cs.uw.edu" indicates that user prefer to use the machines from CS department of UW.
- SelfDomain ---- This attribute just advertise the domain name of the user machine where jobs are submitted. The rank and requirement expression can utilize this attribute to indicate different priorities for jobs coming from different sites.
- Latency ---- This attribute indicates the network latency the user submitted job could be tolerant of. The user prefers to run jobs in a network with latency lower than this value. This attribute should be a real number with unit as ‘second’. For instance: Latency = 0.05, means the preferred latency is less than 50ms.
Machine ClassAds
Network Related Policy