1# Debugging Garbage Collector Related Problems 2 3This page contains some hints on debugging issues specific to the 4Boehm-Demers-Weiser conservative garbage collector. It applies both 5to debugging issues in client code that manifest themselves as collector 6misbehavior, and to debugging the collector itself. 7 8If you suspect a bug in the collector itself, it is strongly recommended that 9you try the latest collector release before proceeding. 10 11## Bus Errors and Segmentation Violations 12 13If the fault occurred in `GC_find_limit`, or with incremental collection 14enabled, this is probably normal. The collector installs handlers to take care 15of these. You will not see these unless you are using a debugger. Your 16debugger _should_ allow you to continue. It's often preferable to tell the 17debugger to ignore SIGBUS and SIGSEGV ("handle SIGSEGV SIGBUS nostop noprint" 18in gdb, "ignore SIGSEGV SIGBUS" in most versions of dbx) and set a breakpoint 19in `abort`. The collector will call abort if the signal had another cause, and 20there was not other handler previously installed. 21 22We recommend debugging without incremental collection if possible. (This 23applies directly to UNIX systems. Debugging with incremental collection under 24win32 is worse. See README.win32.) 25 26If the application generates an unhandled SIGSEGV or equivalent, it may often 27be easiest to set the environment variable `GC_LOOP_ON_ABORT`. On many 28platforms, this will cause the collector to loop in a handler when the SIGSEGV 29is encountered (or when the collector aborts for some other reason), and 30a debugger can then be attached to the looping process. This sidesteps common 31operating system problems related to incomplete core files for multi-threaded 32applications, etc. 33 34## Other Signals 35 36On most platforms, the multi-threaded version of the collector needs one or 37two other signals for internal use by the collector in stopping threads. It is 38normally wise to tell the debugger to ignore these. On Linux, the collector 39currently uses SIGPWR and SIGXCPU by default. 40 41## Warning Messages About Needing to Allocate Blacklisted Blocks 42 43The garbage collector generates warning messages of the form: 44 45 46 Needed to allocate blacklisted block at 0x... 47 48 49or 50 51 52 Repeated allocation of very large block ... 53 54 55when it needs to allocate a block at a location that it knows to be referenced 56by a false pointer. These false pointers can be either permanent (e.g. 57a static integer variable that never changes) or temporary. In the latter 58case, the warning is largely spurious, and the block will eventually 59be reclaimed normally. In the former case, the program will still run 60correctly, but the block will never be reclaimed. Unless the block is intended 61to be permanent, the warning indicates a memory leak. 62 63 1. Ignore these warnings while you are using GC_DEBUG. Some of the routines 64 mentioned below don't have debugging equivalents. (Alternatively, write the 65 missing routines and send them to me.) 66 2. Replace allocator calls that request large blocks with calls to 67 `GC_malloc_ignore_off_page` or `GC_malloc_atomic_ignore_off_page`. You may 68 want to set a breakpoint in `GC_default_warn_proc` to help you identify such 69 calls. Make sure that a pointer to somewhere near the beginning of the 70 resulting block is maintained in a (preferably volatile) variable as long 71 as the block is needed. 72 3. If the large blocks are allocated with realloc, we suggest instead 73 allocating them with something like the following. Note that the realloc 74 size increment should be fairly large (e.g. a factor of 3/2) for this to 75 exhibit reasonable performance. But we all know we should do that anyway. 76 77 78 void * big_realloc(void *p, size_t new_size) { 79 size_t old_size = GC_size(p); 80 void * result; 81 if (new_size <= 10000) return(GC_realloc(p, new_size)); 82 if (new_size <= old_size) return(p); 83 result = GC_malloc_ignore_off_page(new_size); 84 if (result == 0) return(0); 85 memcpy(result,p,old_size); 86 GC_free(p); 87 return(result); 88 } 89 90 91 4. In the unlikely case that even relatively small object (<20KB) 92 allocations are triggering these warnings, then your address space contains 93 lots of "bogus pointers", i.e. values that appear to be pointers but aren't. 94 Usually this can be solved by using `GC_malloc_atomic` or the routines 95 in `gc_typed.h` to allocate large pointer-free regions of bitmaps, etc. 96 Sometimes the problem can be solved with trivial changes of encoding 97 in certain values. It is possible, to identify the source of the bogus 98 pointers by building the collector with `-DPRINT_BLACK_LIST`, which will 99 cause it to print the "bogus pointers", along with their location. 100 5. If you get only a fixed number of these warnings, you are probably only 101 introducing a bounded leak by ignoring them. If the data structures being 102 allocated are intended to be permanent, then it is also safe to ignore them. 103 The warnings can be turned off by calling `GC_set_warn_proc` with 104 a procedure that ignores these warnings (e.g. by doing absolutely nothing). 105 106## The Collector References a Bad Address in GC_malloc 107 108This typically happens while the collector is trying to remove an entry from 109its free list, and the free list pointer is bad because the free list link 110in the last allocated object was bad. 111 112With >99% probability, you wrote past the end of an allocated object. Try 113setting `GC_DEBUG` before including `gc.h` and allocating with `GC_MALLOC`. 114This will try to detect such overwrite errors. 115 116## Unexpectedly Large Heap 117 118Unexpected heap growth can be due to one of the following: 119 120 1. Data structures that are being unintentionally retained. This is commonly 121 caused by data structures that are no longer being used, but were not 122 cleared, or by caches growing without bounds. 123 2. Pointer misidentification. The garbage collector is interpreting integers 124 or other data as pointers and retaining the "referenced" objects. A common 125 symptom is that GC_dump() shows much of the heap as black-listed. 126 3. Heap fragmentation. This should never result in unbounded growth, but 127 it may account for larger heaps. This is most commonly caused by allocation 128 of large objects. 129 4. Per object overhead. This is usually a relatively minor effect, but 130 it may be worth considering. If the collector recognizes interior pointers, 131 object sizes are increased, so that one-past-the-end pointers are correctly 132 recognized. The collector can be configured not to do this 133 (`-DDONT_ADD_BYTE_AT_END`). 134 135The collector rounds up object sizes so the result fits well into the chunk 136size (`HBLKSIZE`, normally 4K on 32 bit machines, 8K on 64 bit machines) used 137by the collector. Thus it may be worth avoiding objects of size 2K + 1 (or 2K 138if a byte is being added at the end.) The last two cases can often 139be identified by looking at the output of a call to `GC_dump`. Among other 140things, it will print the list of free heap blocks, and a very brief 141description of all chunks in the heap, the object sizes they correspond to, 142and how many live objects were found in the chunk at the last collection. 143 144Growing data structures can usually be identified by: 145 146 1. Building the collector with `-DKEEP_BACK_PTRS`, 147 2. Preferably using debugging allocation (defining `GC_DEBUG` before 148 including `gc.h` and allocating with `GC_MALLOC`), so that objects will 149 be identified by their allocation site, 150 3. Running the application long enough so that most of the heap is composed 151 of "leaked" memory, and 152 4. Then calling `GC_generate_random_backtrace` from gc_backptr.h a few times 153 to determine why some randomly sampled objects in the heap are being 154 retained. 155 156The same technique can often be used to identify problems with false pointers, 157by noting whether the reference chains printed 158by `GC_generate_random_backtrace` involve any misidentified pointers. 159An alternate technique is to build the collector with `-DPRINT_BLACK_LIST` 160which will cause it to report values that are almost, but not quite, look like 161heap pointers. It is very likely that actual false pointers will come from 162similar sources. 163 164In the unlikely case that false pointers are an issue, it can usually 165be resolved using one or more of the following techniques: 166 167 1. Use `GC_malloc_atomic` for objects containing no pointers. This is 168 especially important for large arrays containing compressed data, 169 pseudo-random numbers, and the like. It is also likely to improve GC 170 performance, perhaps drastically so if the application is paging. 171 2. If you allocate large objects containing only one or two pointers at the 172 beginning, either try the typed allocation primitives is`gc_typed.h`, 173 or separate out the pointer-free component. 174 3. Consider using `GC_malloc_ignore_off_page` to allocate large objects. 175 (See `gc.h` and above for details. Large means >100K in most environments.) 176 4. If your heap size is larger than 100MB or so, build the collector with 177 `-DLARGE_CONFIG`. This allows the collector to keep more precise black-list 178 information. 179 5. If you are using heaps close to, or larger than, a gigabyte on a 32-bit 180 machine, you may want to consider moving to a platform with 64-bit pointers. 181 This is very likely to resolve any false pointer issues. 182 183## Prematurely Reclaimed Objects 184 185The usual symptom of this is a segmentation fault, or an obviously overwritten 186value in a heap object. This should, of course, be impossible. In practice, 187it may happen for reasons like the following: 188 189 1. The collector did not intercept the creation of threads correctly 190 in a multi-threaded application, e.g. because the client called 191 `pthread_create` without including `gc.h`, which redefines it. 192 2. The last pointer to an object in the garbage collected heap was stored 193 somewhere were the collector could not see it, e.g. in an object allocated 194 with system `malloc`, in certain types of `mmap`ed files, or in some data 195 structure visible only to the OS. (On some platforms, thread-local storage 196 is one of these.) 197 3. The last pointer to an object was somehow disguised, e.g. by XORing 198 it with another pointer. 199 4. Incorrect use of `GC_malloc_atomic` or typed allocation. 200 5. An incorrect `GC_free` call. 201 6. The client program overwrote an internal garbage collector data 202 structure. 203 7. A garbage collector bug. 204 8. (Empirically less likely than any of the above.) A compiler optimization 205 that disguised the last pointer. 206 207The following relatively simple techniques should be tried first to narrow 208down the problem: 209 210 1. If you are using the incremental collector try turning it off for 211 debugging. 212 2. If you are using shared libraries, try linking statically. If that works, 213 ensure that DYNAMIC_LOADING is defined on your platform. 214 3. Try to reproduce the problem with fully debuggable unoptimized code. This 215 will eliminate the last possibility, as well as making debugging easier. 216 4. Try replacing any suspect typed allocation and `GC_malloc_atomic` calls 217 with calls to `GC_malloc`. 218 5. Try removing any `GC_free` calls (e.g. with a suitable `#define`). 219 6. Rebuild the collector with `-DGC_ASSERTIONS`. 220 7. If the following works on your platform (i.e. if gctest still works if 221 you do this), try building the collector with 222 `-DREDIRECT_MALLOC=GC_malloc_uncollectable`. This will cause the collector 223 to scan memory allocated with malloc. 224 225If all else fails, you will have to attack this with a debugger. The suggested 226steps are: 227 228 1. Call `GC_dump` from the debugger around the time of the failure. Verify 229 that the collectors idea of the root set (i.e. static data regions which 230 it should scan for pointers) looks plausible. If not, i.e. if it does not 231 include some static variables, report this as a collector bug. Be sure 232 to describe your platform precisely, since this sort of problem is nearly 233 always very platform dependent. 234 2. Especially if the failure is not deterministic, try to isolate 235 it to a relatively small test case. 236 3. Set a break point in `GC_finish_collection`. This is a good point 237 to examine what has been marked, i.e. found reachable, by the collector. 238 4. If the failure is deterministic, run the process up to the last 239 collection before the failure. Note that the variable `GC_gc_no` counts 240 collections and can be used to set a conditional breakpoint in the right 241 one. It is incremented just before the call to `GC_finish_collection`. 242 If object `p` was prematurely recycled, it may be helpful to look 243 at `*GC_find_header(p)` at the failure point. The `hb_last_reclaimed` field 244 will identify the collection number during which its block was last swept. 245 5. Verify that the offending object still has its correct contents at this 246 point. Then call `GC_is_marked(p)` from the debugger to verify that the 247 object has not been marked, and is about to be reclaimed. Note that 248 `GC_is_marked(p)` expects the real address of an object (the address of the 249 debug header if there is one), and thus it may be more appropriate to call 250 `GC_is_marked(GC_base(p))` instead. 251 6. Determine a path from a root, i.e. static variable, stack, or register 252 variable, to the reclaimed object. Call `GC_is_marked(q)` for each object 253 `q` along the path, trying to locate the first unmarked object, say `r`. 254 7. If `r` is pointed to by a static root, verify that the location pointing 255 to it is part of the root set printed by `GC_dump`. If it is on the stack 256 in the main (or only) thread, verify that `GC_stackbottom` is set correctly 257 to the base of the stack. If it is in another thread stack, check the 258 collector's thread data structure (`GC_thread[]` on several platforms) 259 to make sure that stack bounds are set correctly. 260 8. If `r` is pointed to by heap object `s`, check that the collector's 261 layout description for `s` is such that the pointer field will be scanned. 262 Call `*GC_find_header(s)` to look at the descriptor for the heap chunk. 263 The `hb_descr` field specifies the layout of objects in that chunk. 264 See `gc_mark.h` for the meaning of the descriptor. (If its low order 2 bits 265 are zero, then it is just the length of the object prefix to be scanned. 266 This form is always used for objects allocated with `GC_malloc` or 267 `GC_malloc_atomic`.) 268 9. If the failure is not deterministic, you may still be able to apply some 269 of the above technique at the point of failure. But remember that objects 270 allocated since the last collection will not have been marked, even if the 271 collector is functioning properly. On some platforms, the collector can 272 be configured to save call chains in objects for debugging. Enabling this 273 feature will also cause it to save the call stack at the point of the last 274 GC in `GC_arrays._last_stack`. 275 10. When looking at GC internal data structures remember that a number 276 of `GC_xxx` variables are really macro defined to `GC_arrays._xxx`, so that 277 the collector can avoid scanning them. 278