How to sample the stack
This method is useful when you are called in to diagnose a HTCondor daemon that is using lots of cpu (or blocking a lot on I/O) for an unknown reason. Although crude, I have found this method to lead to the source of trouble in numerous cases.
If pstack
or gstack
are available, just use those. If not, you can use gdb
.
Here's an example gdb script for the schedd:
add-symbol-file /path/to/condor_schedd.dbg set pagination off where quit
Run it a few times with the following command:
gdb -p <pid> < gdb_stack_sampler
If the stack shows that the schedd is usually in a certain part of the code, this may lead you to the source of trouble.
Beware that gdb (and gstack) may sometimes leave the process in a trapped state. Send SIGCONT to the process to let it resume.
If you need to get the HTCondor admin to sample the stack for you, then it is nice to make it even simpler. Below is an example script to do the sampling. (Note that it doesn't load the .dbg file, just to keep things simple.)
#!/bin/sh PID=$1 if [ "$PID" == "" ]; then echo "USAGE: $0 pid" exit 2 fi LOG=stack_sample.$PID echo "Writing output to $LOG." max=10 for i in `seq 1 $max`; do echo "Sample $i of $max" date >> $LOG gdb -p $PID >> $LOG <<EOF set pagination off where quit EOF # in case gdb leaves the process trapped kill -CONT $PID sleep 5 done
How to use callgrind to profile HTCondor
Callgrind is a wonderful profiling tool. The one big disadvantage of it is that it slows down the application considerably (~20 times in my experience).
Here's an example of how to run callgrind on the collector.
condor_off -collector # become the user who runs HTCondor sudo su root valgrind —tool=callgrind /path/to/condor_collector -f -p 9618 >& /tmp/callgrind.log < /dev/null & PID=$!
You will see some files in the current working directory named callgrind.out.$PID
and callgrind.info.$PID
. Change the ownership of these from root to the condor user or callgrind will have trouble writing to them when the program exits.
chown condor:condor callgrind.*.$PID
After running for sufficient time, stop the collector.
kill -TERM $PID
You can then analyze the profile using kcachegrind.
kcachegrind /path/to/callgrind.out.$PID
The call tree is very useful. If you configure kcachegrind to know the path to the code, then you can also see the code for points in the call graph, annotated with profiling information. It is very nice!