1# Debugging Garbage Collector Related Problems
2
3This page contains some hints on debugging issues specific to the
4Boehm-Demers-Weiser conservative garbage collector. It applies both
5to debugging issues in client code that manifest themselves as collector
6misbehavior, and to debugging the collector itself.
7
8If you suspect a bug in the collector itself, it is strongly recommended that
9you try the latest collector release before proceeding.
10
11## Bus Errors and Segmentation Violations
12
13If the fault occurred in `GC_find_limit`, or with incremental collection
14enabled, this is probably normal. The collector installs handlers to take care
15of these. You will not see these unless you are using a debugger. Your
16debugger _should_ allow you to continue. It's often preferable to tell the
17debugger to ignore SIGBUS and SIGSEGV ("handle SIGSEGV SIGBUS nostop noprint"
18in gdb, "ignore SIGSEGV SIGBUS" in most versions of dbx) and set a breakpoint
19in `abort`. The collector will call abort if the signal had another cause, and
20there was not other handler previously installed.
21
22We recommend debugging without incremental collection if possible. (This
23applies directly to UNIX systems. Debugging with incremental collection under
24win32 is worse. See README.win32.)
25
26If the application generates an unhandled SIGSEGV or equivalent, it may often
27be easiest to set the environment variable `GC_LOOP_ON_ABORT`. On many
28platforms, this will cause the collector to loop in a handler when the SIGSEGV
29is encountered (or when the collector aborts for some other reason), and
30a debugger can then be attached to the looping process. This sidesteps common
31operating system problems related to incomplete core files for multi-threaded
32applications, etc.
33
34## Other Signals
35
36On most platforms, the multi-threaded version of the collector needs one or
37two other signals for internal use by the collector in stopping threads. It is
38normally wise to tell the debugger to ignore these. On Linux, the collector
39currently uses SIGPWR and SIGXCPU by default.
40
41## Warning Messages About Needing to Allocate Blacklisted Blocks
42
43The garbage collector generates warning messages of the form:
44
45
46    Needed to allocate blacklisted block at 0x...
47
48
49or
50
51
52    Repeated allocation of very large block ...
53
54
55when it needs to allocate a block at a location that it knows to be referenced
56by a false pointer. These false pointers can be either permanent (e.g.
57a static integer variable that never changes) or temporary. In the latter
58case, the warning is largely spurious, and the block will eventually
59be reclaimed normally. In the former case, the program will still run
60correctly, but the block will never be reclaimed. Unless the block is intended
61to be permanent, the warning indicates a memory leak.
62
63  1. Ignore these warnings while you are using GC_DEBUG. Some of the routines
64  mentioned below don't have debugging equivalents. (Alternatively, write the
65  missing routines and send them to me.)
66  2. Replace allocator calls that request large blocks with calls to
67  `GC_malloc_ignore_off_page` or `GC_malloc_atomic_ignore_off_page`. You may
68  want to set a breakpoint in `GC_default_warn_proc` to help you identify such
69  calls. Make sure that a pointer to somewhere near the beginning of the
70  resulting block is maintained in a (preferably volatile) variable as long
71  as the block is needed.
72  3. If the large blocks are allocated with realloc, we suggest instead
73  allocating them with something like the following. Note that the realloc
74  size increment should be fairly large (e.g. a factor of 3/2) for this to
75  exhibit reasonable performance. But we all know we should do that anyway.
76
77
78        void * big_realloc(void *p, size_t new_size) {
79            size_t old_size = GC_size(p);
80            void * result;
81            if (new_size <= 10000) return(GC_realloc(p, new_size));
82            if (new_size <= old_size) return(p);
83            result = GC_malloc_ignore_off_page(new_size);
84            if (result == 0) return(0);
85            memcpy(result,p,old_size);
86            GC_free(p);
87            return(result);
88        }
89
90
91  4. In the unlikely case that even relatively small object (<20KB)
92  allocations are triggering these warnings, then your address space contains
93  lots of "bogus pointers", i.e. values that appear to be pointers but aren't.
94  Usually this can be solved by using `GC_malloc_atomic` or the routines
95  in `gc_typed.h` to allocate large pointer-free regions of bitmaps, etc.
96  Sometimes the problem can be solved with trivial changes of encoding
97  in certain values. It is possible, to identify the source of the bogus
98  pointers by building the collector with `-DPRINT_BLACK_LIST`, which will
99  cause it to print the "bogus pointers", along with their location.
100  5. If you get only a fixed number of these warnings, you are probably only
101  introducing a bounded leak by ignoring them. If the data structures being
102  allocated are intended to be permanent, then it is also safe to ignore them.
103  The warnings can be turned off by calling `GC_set_warn_proc` with
104  a procedure that ignores these warnings (e.g. by doing absolutely nothing).
105
106## The Collector References a Bad Address in GC_malloc
107
108This typically happens while the collector is trying to remove an entry from
109its free list, and the free list pointer is bad because the free list link
110in the last allocated object was bad.
111
112With >99% probability, you wrote past the end of an allocated object. Try
113setting `GC_DEBUG` before including `gc.h` and allocating with `GC_MALLOC`.
114This will try to detect such overwrite errors.
115
116## Unexpectedly Large Heap
117
118Unexpected heap growth can be due to one of the following:
119
120  1. Data structures that are being unintentionally retained. This is commonly
121  caused by data structures that are no longer being used, but were not
122  cleared, or by caches growing without bounds.
123  2. Pointer misidentification. The garbage collector is interpreting integers
124  or other data as pointers and retaining the "referenced" objects. A common
125  symptom is that GC_dump() shows much of the heap as black-listed.
126  3. Heap fragmentation. This should never result in unbounded growth, but
127  it may account for larger heaps. This is most commonly caused by allocation
128  of large objects.
129  4. Per object overhead. This is usually a relatively minor effect, but
130  it may be worth considering. If the collector recognizes interior pointers,
131  object sizes are increased, so that one-past-the-end pointers are correctly
132  recognized. The collector can be configured not to do this
133  (`-DDONT_ADD_BYTE_AT_END`).
134
135The collector rounds up object sizes so the result fits well into the chunk
136size (`HBLKSIZE`, normally 4K on 32 bit machines, 8K on 64 bit machines) used
137by the collector. Thus it may be worth avoiding objects of size 2K + 1 (or 2K
138if a byte is being added at the end.)  The last two cases can often
139be identified by looking at the output of a call to `GC_dump`. Among other
140things, it will print the list of free heap blocks, and a very brief
141description of all chunks in the heap, the object sizes they correspond to,
142and how many live objects were found in the chunk at the last collection.
143
144Growing data structures can usually be identified by:
145
146  1. Building the collector with `-DKEEP_BACK_PTRS`,
147  2. Preferably using debugging allocation (defining `GC_DEBUG` before
148  including `gc.h` and allocating with `GC_MALLOC`), so that objects will
149  be identified by their allocation site,
150  3. Running the application long enough so that most of the heap is composed
151  of "leaked" memory, and
152  4. Then calling `GC_generate_random_backtrace` from gc_backptr.h a few times
153  to determine why some randomly sampled objects in the heap are being
154  retained.
155
156The same technique can often be used to identify problems with false pointers,
157by noting whether the reference chains printed
158by `GC_generate_random_backtrace` involve any misidentified pointers.
159An alternate technique is to build the collector with `-DPRINT_BLACK_LIST`
160which will cause it to report values that are almost, but not quite, look like
161heap pointers. It is very likely that actual false pointers will come from
162similar sources.
163
164In the unlikely case that false pointers are an issue, it can usually
165be resolved using one or more of the following techniques:
166
167  1. Use `GC_malloc_atomic` for objects containing no pointers. This is
168  especially important for large arrays containing compressed data,
169  pseudo-random numbers, and the like. It is also likely to improve GC
170  performance, perhaps drastically so if the application is paging.
171  2. If you allocate large objects containing only one or two pointers at the
172  beginning, either try the typed allocation primitives is`gc_typed.h`,
173  or separate out the pointer-free component.
174  3. Consider using `GC_malloc_ignore_off_page` to allocate large objects.
175  (See `gc.h` and above for details. Large means >100K in most environments.)
176  4. If your heap size is larger than 100MB or so, build the collector with
177  `-DLARGE_CONFIG`. This allows the collector to keep more precise black-list
178  information.
179  5. If you are using heaps close to, or larger than, a gigabyte on a 32-bit
180  machine, you may want to consider moving to a platform with 64-bit pointers.
181  This is very likely to resolve any false pointer issues.
182
183## Prematurely Reclaimed Objects
184
185The usual symptom of this is a segmentation fault, or an obviously overwritten
186value in a heap object. This should, of course, be impossible. In practice,
187it may happen for reasons like the following:
188
189  1. The collector did not intercept the creation of threads correctly
190  in a multi-threaded application, e.g. because the client called
191  `pthread_create` without including `gc.h`, which redefines it.
192  2. The last pointer to an object in the garbage collected heap was stored
193  somewhere were the collector could not see it, e.g. in an object allocated
194  with system `malloc`, in certain types of `mmap`ed files, or in some data
195  structure visible only to the OS. (On some platforms, thread-local storage
196  is one of these.)
197  3. The last pointer to an object was somehow disguised, e.g. by XORing
198  it with another pointer.
199  4. Incorrect use of `GC_malloc_atomic` or typed allocation.
200  5. An incorrect `GC_free` call.
201  6. The client program overwrote an internal garbage collector data
202  structure.
203  7. A garbage collector bug.
204  8. (Empirically less likely than any of the above.) A compiler optimization
205  that disguised the last pointer.
206
207The following relatively simple techniques should be tried first to narrow
208down the problem:
209
210  1. If you are using the incremental collector try turning it off for
211  debugging.
212  2. If you are using shared libraries, try linking statically. If that works,
213  ensure that DYNAMIC_LOADING is defined on your platform.
214  3. Try to reproduce the problem with fully debuggable unoptimized code. This
215  will eliminate the last possibility, as well as making debugging easier.
216  4. Try replacing any suspect typed allocation and `GC_malloc_atomic` calls
217  with calls to `GC_malloc`.
218  5. Try removing any `GC_free` calls (e.g. with a suitable `#define`).
219  6. Rebuild the collector with `-DGC_ASSERTIONS`.
220  7. If the following works on your platform (i.e. if gctest still works if
221  you do this), try building the collector with
222  `-DREDIRECT_MALLOC=GC_malloc_uncollectable`. This will cause the collector
223  to scan memory allocated with malloc.
224
225If all else fails, you will have to attack this with a debugger. The suggested
226steps are:
227
228  1. Call `GC_dump` from the debugger around the time of the failure. Verify
229  that the collectors idea of the root set (i.e. static data regions which
230  it should scan for pointers) looks plausible. If not, i.e. if it does not
231  include some static variables, report this as a collector bug. Be sure
232  to describe your platform precisely, since this sort of problem is nearly
233  always very platform dependent.
234  2. Especially if the failure is not deterministic, try to isolate
235  it to a relatively small test case.
236  3. Set a break point in `GC_finish_collection`. This is a good point
237  to examine what has been marked, i.e. found reachable, by the collector.
238  4. If the failure is deterministic, run the process up to the last
239  collection before the failure. Note that the variable `GC_gc_no` counts
240  collections and can be used to set a conditional breakpoint in the right
241  one. It is incremented just before the call to `GC_finish_collection`.
242  If object `p` was prematurely recycled, it may be helpful to look
243  at `*GC_find_header(p)` at the failure point. The `hb_last_reclaimed` field
244  will identify the collection number during which its block was last swept.
245  5. Verify that the offending object still has its correct contents at this
246  point. Then call `GC_is_marked(p)` from the debugger to verify that the
247  object has not been marked, and is about to be reclaimed. Note that
248  `GC_is_marked(p)` expects the real address of an object (the address of the
249  debug header if there is one), and thus it may be more appropriate to call
250  `GC_is_marked(GC_base(p))` instead.
251  6. Determine a path from a root, i.e. static variable, stack, or register
252  variable, to the reclaimed object. Call `GC_is_marked(q)` for each object
253  `q` along the path, trying to locate the first unmarked object, say `r`.
254  7. If `r` is pointed to by a static root, verify that the location pointing
255  to it is part of the root set printed by `GC_dump`. If it is on the stack
256  in the main (or only) thread, verify that `GC_stackbottom` is set correctly
257  to the base of the stack. If it is in another thread stack, check the
258  collector's thread data structure (`GC_thread[]` on several platforms)
259  to make sure that stack bounds are set correctly.
260  8. If `r` is pointed to by heap object `s`, check that the collector's
261  layout description for `s` is such that the pointer field will be scanned.
262  Call `*GC_find_header(s)` to look at the descriptor for the heap chunk.
263  The `hb_descr` field specifies the layout of objects in that chunk.
264  See `gc_mark.h` for the meaning of the descriptor. (If its low order 2 bits
265  are zero, then it is just the length of the object prefix to be scanned.
266  This form is always used for objects allocated with `GC_malloc` or
267  `GC_malloc_atomic`.)
268  9. If the failure is not deterministic, you may still be able to apply some
269  of the above technique at the point of failure. But remember that objects
270  allocated since the last collection will not have been marked, even if the
271  collector is functioning properly. On some platforms, the collector can
272  be configured to save call chains in objects for debugging. Enabling this
273  feature will also cause it to save the call stack at the point of the last
274  GC in `GC_arrays._last_stack`.
275  10. When looking at GC internal data structures remember that a number
276  of `GC_xxx` variables are really macro defined to `GC_arrays._xxx`, so that
277  the collector can avoid scanning them.
278