All platforms
<SUBSYS>_DEBUG_WAIT
When <SUBSYS>_DEBUG_WAIT is true, a daemon will pause on startup, running a loop of bool debug_wait=1;while(debug_wait){sleep(1)};. That way you can attach to the daemon with a debugger and continue the run.
(Prior to late June, 2012, this doesn't work reliably. debug_wait could be optimized out. See #3064 for details and workaround.)
On Windows you might prefer the slightly more convenient <SUBSYS>_WAIT_FOR_DEBUGGER, documented below.
GDB specific issues
clone(), gdb internal-errors
HTCondor uses clone in a way that causes GDB grief. Running a HTCondor daemon under GDB will likely fail with errors similar to:
warning: Can't attach LWP -1: No such process ../../gdb/linux-thread-db.c:389: internal-error: thread_get_info_callback: Assertion `inout->thread_info != NULL' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n)
The solution is to tell HTCondor to not to use clone, but to fall back on the fork code path. In your HTCondor configuration file, use:
USE_CLONE_TO_CREATE_PROCESSES = false
(This page primarily exists as search bait, so that others hitting this error can quickly find the solution.)
For the adventurous (or those who are debugging the clone calls), one can apply the patch found on #2208
Debug a core file produced by a stripped binary
Say you have a.out.stripped and a.out.with_symbols. And say you have a core file from a.out.stripped that you want to debug, but are being thwarted by the lack of symbols. The trick is you still want to gdb the file that produced the core, but load symbols from the unstripped binary into the text segment of the file you are debugging.
% gdb a.out.stripped core
... gdb will complain about missing debug info ...
(gdb) maint info sections
In the output from 'maint info sections', look for .text section. The address that is in the first column is what you want to use in the next command. For instance, one of the output lines from 'maint info sections' will be
0x00400390->0x00400588 at 0x00000390: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
The address in the first column, 0x00400390 in this example, is used in the next gdb command:
(gdb) add-symbol-file a.out.with_symbols 0x00400390
Of course, substitute 0x00400390 with the proper text segment address.
That will get your symbols loaded for the main executable, but you still won't have any symbols for libraries that were dynamically loaded, which these days is most of HTCondor.
So, to get those, you need to do a couple things.
First, run gdb on the unstripped version of the .so you would like to load and find the .text segment just like above, and write down the offset (the first column), you'll need it later.
Second, go back to gdb with the stripped binary and the core. If you look at any particular stack frame, you'll see the instruction pointer in the second column:
#7 0x00007f034f8c1834 in KeyCache::remove ()
Now comes the black-magic portion. Run "maint info sections" again and look at the first column. Find the entry that contains the above instruction pointer, The Price Is Right-style. That is, find the largest value without going over your instruction pointer. It should be marked "ALLOC READONLY CODE".
In the following snippet, if your instruction pointer is 0x00002af760aa32a5, then the address you want is 0x2af76081e000:
0x3fd0424000->0x3fd0424000 at 0x006ce000: load81 ALLOC READONLY 0x3fd0623000->0x3fd0625000 at 0x006ce000: load82 ALLOC LOAD HAS_CONTENTS 0x2af76080e000->0x2af760810000 at 0x006d0000: load83 ALLOC LOAD HAS_CONTENTS 0x2af76081e000->0x2af76081e000 at 0x006d2000: load84 ALLOC READONLY CODE 0x2af760b76000->0x2af760b76000 at 0x006d2000: load85 ALLOC READONLY 0x2af760d75000->0x2af760d88000 at 0x006d2000: load86 ALLOC LOAD HAS_CONTENTS 0x2af760d88000->0x2af760d8e000 at 0x006e5000: load87 ALLOC LOAD HAS_CONTENTS 0x2af760d8e000->0x2af760d8e000 at 0x006eb000: load88 ALLOC READONLY CODE
Now, add that address to the offset of the .text segment from the .so file you found above. You can do this right in gdb:
(gdb) print /x 0x000d58a0 + 0x2af76081e000 $1 = 0x2af7608f38a0 (gdb)
Finally, you can add the symbols from the shared library with
add-symbol-file unstripped.so 0xMagicAddress
You'll have to do this for each shared library (libcondor_util, libclassad, etc.) if you need the symbols for that particular stack frame.
Note that the HTCondor binaries are linked separately for the tarball and native packages (rpm and deb), with slightly different settings. For example, the RUNPATH is set differently. Thus, the binaries from the unstripped tarball may not work properly with a core file generated from an RPM installation.
Windows specific issues
<SUBSYS>_WAIT_FOR_DEBUGGER
In addition to <SUBSYS>_DEBUG_WAIT, on Windows (#ifdef WIN32) you have access to <SUBSYS>_WAIT_FOR_DEBUGGER. It's very similar, but is smart enough to automatically detect when a debugger is attached.
Using the Windows heap verifier
The Windows SDK includes a bunch of Debugging Tools For Windows, one of which is the heap verifier. I can detect heap corruption early and provide a stack trace at or near the corrupting code. To do this you will need to install http://www.microsoft.com/en-us/download/confirmation.aspx?id=8279 to get gflags and cdb/ntsd
The steps are:
- use gflags to turn on heap checking for the demon
- replace the daemon with a script that runs the daemon under the debugger
- start condor. when the daemon starts up, attach to the debugger and hit 'g' to let the daemon begin running.
You an replace the daemon with a batch script like this one, which replaces the condor_starter. save this as c:\condor\bin\debug_condor_starter.bat
@echo off @setlocal @if NOT "%1"=="-classad" goto :job @REM just run the starter normally %~dp0condor_starter.exe %* @goto :EOF :job @REM echo %* > c:\condor\log\starterargs set remote=-server tcp:port=5099:6000 set log=-logo c:\condor\log\DStarterLog ::set nobreak=-g set _NT_SYMBOL_PATH=c:\condor\bin;SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols "C:\Program Files\Debugging Tools for Windows (x64)\cdb.exe" %remote% %log% -x %nobreak% %~dp0condor_starter.exe %*
And add this to your configuration
STARTER = $(BIN)\debug_condor_starter.bat
Startup a command prompt with run-as-administrator and execute:
"C:\Program Files\Debugging Tools for Windows (x64)\gflags" /p /enable condor_starter.exe
"C:\Program Files\Debugging Tools for Windows (x64)\gflags" /p /enable condor_starter.exe /full
Start HTCondor, once the daemon starts up, run this command
"C:\Program Files\Debugging Tools for Windows (x64)\cdb" -remote tcp:port=5099:6000
nobreak
set in the script above, the debugger will be broken and the daemon will not yet have started. run the folling commands to run the daemon
.lines g