HTCondorWiki: Linux Tuning

Linux Scalability

Doing large scale grid work, we regularlly press various limits of Linux and other systems. If you're in a situation where you're pushing various limits like open file descriptors and network sockets, here is how to ensure that the limits are large enough.

At several points I suggest making changes to the Linux kernel's configuration by echoing data into the /proc filesystem. This changes are transient and the system will reset to the default values on an reboot. As a result, you'll want to place these changes somewhere where they will be automatically reapplied on reboot. On many Linux systems, you can use the /etc/rc.d/rc.local script to do this. Depending on your particular configuration, you might also be able to use /etc/sysctl.conf, although you'll need to check the documentation for sysctl for the correct format.

System-wide file descriptors

This is the number of concurrently open file descriptors throughout the system. It defaults to 8192. We've found it important to increase this for Globus head nodes and Condor-G submit nodes.

To see the current limit:

    cat /proc/sys/fs/file-max

To set the limit, as root run the following, replacing 32768 with your desired limit.

    echo 32768 > /proc/sys/fs/file-max

To have that automatically set, put the above line in your /etc/rc.d/rc.local (or the equivalent for your distribution).

For Linux 2.2.x based systems, you'll also want to consider the inode-max. On Linux 2.4.x, this no longer necessary. You adjusting similarlly to file-max, but you adjust <tt>/proc/sys/fs/inode-max</tt>. The Linux documentation suggests that inode-max be 4 or 5 times the file-max. Again, you'll want to add it to your rc.local or equivalent boot script.

    echo 131072 > /proc/sys/fs/inode-max

On some systems (this appears to be the case for all recent Red Hat distributions), you can instead add a line to your /etc/sysctl.conf. You would do this instead of the above commands. This will be the more natural solution for such systems, but you'll need to reboot the system or use the sysctl program for it to take effect. You need to append the following to your /etc/sysctl.conf:

    # increase system file descriptor limit
    fs.file-max = 32768

To verify that the values are correctly set, you can use the following command:

    cat /proc/sys/fs/file-max

Per-process file descriptor limits

Each user has per-process file descriptor limits. It defaults to 1024. Unfortunately users can only increase it to the hard limit, which is also 1024. Only root can increase the hard limit. We've found it important to increase this when starting the condor_master for Condor-G.

You can check the limit with:

    ulimit -n # sh
    limit descriptors # csh

You may be able to give each user a larger limit in the /etc/security/limits.conf file. This will only apply to Condor daemons started as the user in question. At the moment (October 2003) Condor will ignore these limits when run as root. Condor will likely take into account these limits in a future release. Something like the following will increase the user example's file limit:

    john hard nofile 16384

One way to increase it is to become root, increase the limit, switch back the user, and start the program you want it increased for:

    su - root
    ulimit -n 16384 # if root uses sh
    limit descriptors 16384 # if root uses csh
    su - your_user_name
    program_to_run

You can instead increase the limit for a user continuously. /etc/security/limits.conf contains the limits assigned by the login programs. Be sure to set both hard and soft for nofile (number of files).

Condor daemons that run as root can increase their own file descriptor limit. To configure a higher limit, use MAX_FILE_DESCRIPTORS or <subsys>_MAX_FILE_DESCRIPTORS in the condor configuration. Note that using a file descriptor limit that is much larger than needed can add overhead that reduces performance.

Process Identifiers

By default Linux wraps around the process identifiers when they exceed 32768. Increasing the interval between these rollovers is beneficial to ease process identifcation and process management within Condor (i.e., prevent fork() failures). You can find the current maximum pid number with:

    cat /proc/sys/kernel/pid_max

You can set the value higher (up to 2^22 on 32-bit machines: 4,194,304) with:

    echo 4194303 > /proc/sys/kernel/pid_max

WARNING: Increasing this limit may break programs that use the old-style SysV IPC interface and are linked against pre-2.2 glibcs. Although it is unlikely that you will run any of these programs.

Again, you'll want to add it to your rc.local or equivalent boot script to ensure this is reset on every boot.

    #Allow for more PIDs (to reduce rollover problems); may break some programs
    kernel.pid_max = 4194303

To verify that the values are correctly set, you can use the following command:

    cat /proc/sys/kernel/pid_max

Local port range

If your system is opening lots of outgoing connections, you might run out of ports. If a program doesn't specify the source address, it comes out of a limited range. By default that range is 1024 to 4999, allowing 3975 simultaneous outgoing connections. We've found it important to increase this for Condor-G submit nodes. You can find the current range with:

    cat /proc/sys/net/ipv4/ip_local_port_range

You can set the range with:

    echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range

Again, you'll want to add it to your rc.local or equivalent boot script to ensure this is reset on every boot.

    # increase system IP port limits
    net.ipv4.ip_local_port_range = 1024 65535

To verify that the values are correctly set, you can use the following command:

    cat /proc/sys/net/ipv4/ip_local_port_range

Partitions for the Job Queue

Carefully consider which partition is used for the job_queue.log file. This file is a transaction log for the HTCondor schedd; as such, it is frequently fsync'd to disk.

This can be problematic if job outputs and the transaction log are on the same partition. The default ordering mode (data=ordered) on ext3 and ext4 cause all outstanding data to be flushed to disk during an fsync of the transaction log, which may include very large job outputs. This can cause significant latency for fsync to complete - slowing the schedd and making it unresponsive.

Large submit hosts should consider one of the following options:

Setting data=writeback in the partition's mount options.
Moving the job_queue.log to a separate partition. Using a SSD is rarely needed.
Using a different Linux filesystem, such as XFS.

The first option requires the least changes to the host, but has some drawbacks. Applications which are not properly designed may lose data after a power loss as they rely on special ext3 behavior which is above and beyond POSIX. For ext4, there is also a possibility that delayed-allocation mode can cause garbage data to appear in files after a power cycle.

Max Connection Backlog

As of 8.1.5, HTCondor has a default connection backlog of 500; that means up to 500 clients may contact a daemon while it is blocked before they start getting connection errors. Many more than 500 clients may connect to the daemon total - this limit solely refers to the number of clients that connect before accept() is called on the TCP socket.

Linux will silently truncate the maximum backlog to 128 connections unless a system-wide limit is raised. We recommend using:

    echo 500 >  /proc/sys/net/core/somaxconn