How to sample the stack

This method is useful when you are called in to diagnose a HTCondor daemon that is using lots of cpu (or blocking a lot on I/O) for an unknown reason. Although crude, I have found this method to lead to the source of trouble in numerous cases.

If pstack or gstack are available, just use those. If not, you can use gdb.

Here's an example gdb script for the schedd:

add-symbol-file /path/to/condor_schedd.dbg
set pagination off
where
quit

Run it a few times with the following command:

gdb -p <pid> < gdb_stack_sampler

If the stack shows that the schedd is usually in a certain part of the code, this may lead you to the source of trouble.

Beware that gdb (and gstack) may sometimes leave the process in a trapped state. Send SIGCONT to the process to let it resume.

If you need to get the HTCondor admin to sample the stack for you, then it is nice to make it even simpler. Below is an example script to do the sampling. (Note that it doesn't load the .dbg file, just to keep things simple.)

#!/bin/sh

PID=$1
if [ "$PID" == "" ]; then
  echo "USAGE: $0 pid"
  exit 2
fi

LOG=stack_sample.$PID
echo "Writing output to $LOG."
max=10
for i in `seq 1 $max`; do
echo "Sample $i of $max"
date >> $LOG

gdb -p $PID >> $LOG <<EOF
   set pagination off
   where
   quit
EOF
# in case gdb leaves the process trapped
kill -CONT $PID
sleep 5
done

How to use callgrind to profile HTCondor

Callgrind is a wonderful profiling tool. The one big disadvantage of it is that it slows down the application considerably (~20 times in my experience).

Here's an example of how to run callgrind on the collector.

condor_off -collector
# become the user who runs HTCondor
sudo su root
valgrind —tool=callgrind /path/to/condor_collector -f -p 9618 >& /tmp/callgrind.log < /dev/null &
PID=$!

You will see some files in the current working directory named callgrind.out.$PID and callgrind.info.$PID. Change the ownership of these from root to the condor user or callgrind will have trouble writing to them when the program exits.

chown condor:condor callgrind.*.$PID

After running for sufficient time, stop the collector.

kill -TERM $PID

You can then analyze the profile using kcachegrind.

kcachegrind /path/to/callgrind.out.$PID

The call tree is very useful. If you configure kcachegrind to know the path to the code, then you can also see the code for points in the call graph, annotated with profiling information. It is very nice!