Sometimes Linux jobs run, are preempted and can not start again because HTCondor thinks the image size of the job is too big. This is because HTCondor has a problem calculating the image size of a program on Linux that uses threads. It is particularly noticeable in the Java universe, but it also happens in the vanilla universe. It is not an issue in the standard universe, because threaded programs are not allowed.

On Linux, each thread appears to consume as much memory as the entire program consumes, so the image size appears to be (number-of-threads * image-size-of-program). If your program uses a lot of threads, your apparent image size balloons. You can see the image size that HTCondor believes your program has by using the -l option to condor_q, and looking at the ImageSize attribute.

When you submit your job, HTCondor creates or extends the requirements for your job. In particular, it adds a requirement that you job must run on a machine with sufficient memory:

Requirements = ... ((Memory * 1024) >= ImageSize) ...

Note that memory is the execution machine's memory in Mbytes, while ImageSize is in Kbytes. ImageSize is not a perfect measure of the memory requirements of a job. It over-counts memory that is shared between processes. It may appear quite large if the job uses mmap on a large file. It does not account for memory that the job uses indirectly in the operating system's file system cache.

In the Requirements expression above, HTCondor added (Memory * 1024) >= ImageSize on behalf of the job. To prevent HTCondor from doing this, provide your own expression about memory in the submit description file, as in this example:

Requirements = Memory > 1024

You will need to change the value 1024 to a reasonably good estimate of the actual memory requirements of the program, in Mbytes. This example says that the program requires 1 Gbyte of memory. If you underestimate the memory your application needs, you may have bad performance if the job runs on machines that have insufficient memory.

In addition, if you have modified your machine policies to preempt jobs when ImageSize is large, you will need to change those policies.