c: strategies for debugging obscure memory leaks?

Question

I'm working on a project in c, and I'm trying to understand how to debug an obscure bug that crashes my program. Its kinda large, attempts to isolate the problem by making smaller versions of the code are not working. So I'm trying to come up with a way to debug and pinpoint a memory leak.

I came up with the following plan: I know the problem comes from running a certain function, and that function calls itself recursively. So I thought I could make a snapshot of sorts of my program memory allocation. Since I don't know jack about what happens under the hood (I know a little not enough to be useful in this situation):

typedef struct record_mem {
    int num_allocs;
    int num_frees;
    int size_space;
    int num_structure_1;
    ...
    int num_structure_N;
    int num_records;
    struct record_mem *next;
} RECORD;
extern RECORD *top;
void pushmem(RECORD **top)
{
    RECORD *nnew = 0;
    RECORD *nnew = (RECORD *)malloc(sizeof(RECORD));
    nnew->num_allocs=1;
    nnew->num_frees=0;
    nnew->size_space=sizeof(RECORD);
    nnew->num_structure_1=0;
    ...
    nnew->num_structure_N=0;
    nnew->num_records=1;
    nnew->next=0;
    if(*top)
    {
        nnew->num_allocs+=(*top)->num_allocs;
        nnew->num_frees=(*top)->num_frees;
        nnew->size_space+=(*top)->size_space;
            nnew->num_structure_1=(*top)->num_allocs;
            ...
            nnew->num_structure_N=(*top)->num_allocs;
            nnew->num_records+=(*top)->num_records;
        nnew->next=*top;
    }
    *top=nnew;
}

the idea being I print out the contents of my memory record keeping right before the moment my program crashes (I know where it crashes thanks to GDB).

and then throughout the program (for each data structure in my program I have a similar push function like above) I can simply add a one liner with a function tallying datastructure allocations plus total stack (heap?) memory allocations (that I can keep track of). I simply make more memory_record structures wherever I feel the need to record a snapshot of my program running. Problem is this memory balance sheet recording won't helping if I can't somehow record how much memory is actually being used.

But how would I do this? Plus how would I take dangling pointers and leaks into account? I'm using OS X and I'm currently looking up how I could record the stack pointer and other things.

EDIT: Since you asked: output of valgrind: (closure() is the function called from main that returns the bad pointer: Its supposed to return the head of a doubly linked-list, traversehashmap() is a function called from closure() I use to calculate and append extra node to the linked-list and which calls itself recursively because it needs to jump around between nodes.)

jason-danckss-macbook:project Jason$ valgrind --leak-check=full --tool=memcheck ./testc
Will attempt to compute closure of AB:
Result: testcl: 0x10000d0b0
==7682== Invalid read of size 8
==7682==    at 0x100001D4E: printrelation2 (relation.h:490)
==7682==    by 0x100003CFE: main (test-computation.c:47)
==7682==  Address 0x10000cee8 is 8 bytes inside a block of size 24 free'd
==7682==    at 0xD828: free (vg_replace_malloc.c:450)
==7682==    by 0x100001232: destroyrelation2 (relation.h:161)
==7682==    by 0x100003407: destroyallhashmap (computation.h:333)
==7682==    by 0x1000039E1: closure (computation.h:539)
==7682==    by 0x100003CBE: main (test-computation.c:38)
==7682== 
==7682== 
==7682== HEAP SUMMARY:
==7682==     in use at exit: 5,360 bytes in 48 blocks
==7682==   total heap usage: 99 allocs, 51 frees, 6,640 bytes allocated
==7682== 
==7682== 48 (24 direct, 24 indirect) bytes in 1 blocks are definitely lost in loss record 33 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100001104: getnewrelation (relation.h:134)
==7682==    by 0x100001848: copyrelation (relation.h:343)
==7682==    by 0x100003991: closure (computation.h:531)
==7682==    by 0x100003CBE: main (test-computation.c:38)
==7682== 
==7682== 1,128 (24 direct, 1,104 indirect) bytes in 1 blocks are definitely lost in loss record 36 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100002315: getnewholder (dependency.h:129)
==7682==    by 0x100003B17: main (test-computation.c:14)
==7682== 
==7682== LEAK SUMMARY:
==7682==    definitely lost: 48 bytes in 2 blocks
==7682==    indirectly lost: 1,128 bytes in 44 blocks
==7682==      possibly lost: 0 bytes in 0 blocks
==7682==    still reachable: 4,096 bytes in 1 blocks
==7682==         suppressed: 88 bytes in 1 blocks
==7682== Reachable blocks (those to which a pointer was found) are not shown.
==7682== To see them, rerun with: --leak-check=full --show-reachable=yes
==7682== 
==7682== For counts of detected and suppressed errors, rerun with: -v
==7682== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

Does the program crash because your function has too deep recursion? Stack space is limited, and each time a function recurses it uses up more stack space. — MicroVirus, Apr 13 '14 at 20:09
I know you said not to tell you to use valgrind. But I feel the need to urge you to try it out. It is likely that the time spent learning to use it will be less then the time spent hacking at the code to find the dangling pointer.. — odedsh, Apr 13 '14 at 20:16
No, It crashes because it returns a bad pointer. I've only tested my code under one specific condition, it just cycles through a hashmap in my program and jumps back and forth between nodes theres no reason it should ever go beyond maybe 10 recursive calls. I'll update with the info if you think its necessary — jason dancks, Apr 13 '14 at 20:16
I've tried using valgrind, I can't figure out how the information it spills out can be helpful to debugging, other than telling me I'm leaking memory. It looks like it does a tally for how much is lost with each specific function call. — jason dancks, Apr 13 '14 at 20:19
On unix, you have the great chance of having `valgrind`. Getting to know how to use it *is worth* the extra time it requires, and that extra time isn't that long anyway, since valgrind will typically show you were your leak is. Also, it's unclear to me, is the problem a *memory leak* (which ends with a out of memory error) or a crash caused by a *segmentation fault* (dereferencing a dangling pointer)? Without valgrind, you're left with printf and manual code review. Also, your idea to track memory is close to what ̀`memcheck` achieves in valgrind. Don't reinvent the wheel. — dureuill, Apr 13 '14 at 20:26
The technical cause is bus error: invalid mapped address. It happens when I attempt to dereference a struct returned by my function — jason dancks, Apr 13 '14 at 20:31
@jasondancks "I've tried using valgrind, I can't figure out how the information it spills out" - *We **can***. Add it to the question. If the same blocks appear to be repeating, you can leave those out after a few repetitions. It would also help if the code posted *compiled*. (i.e. the duplicity of `nnew` is clearly not in your **real** code). — WhozCraig, Apr 13 '14 at 20:31
How is the struct returned? By value or by pointer? If by pointer, how do you allocate memory? Do you happen to return a struct by pointer by taking the address (using operator ̀ & ̀ ) of a local variable in the function that returns a struct? If so, the pointer is no longer valid after the function returns. It's a wild guess, but also a very common mistake. — dureuill, Apr 13 '14 at 20:35
The code posted above wasn't really meant to be compiled, as it was a work in progress, I just needed help to complete the train of thought to make it useful for this purpose. I use `nnew` all the time for variable names. I'll stop, I had no idea it would cause an error? — jason dancks, Apr 13 '14 at 20:50
I return by pointer, I almost never do the opposite unless its a simple integer, and I use malloc() as my understanding is using malloc to grab from the heap (?) is what can make what a pointer points to global in scope? — jason dancks, Apr 13 '14 at 20:52
Using `malloc` is the right thing to do. Have you checked it's argument (I once encountered a serious memory error because I wrongly parenthesized the arguments to malloc)? At this point, I think @WhozCraig is right that you should show us the code and valgrind output, so that we can try and find out what really happens. — dureuill, Apr 13 '14 at 20:58

score 5 · Answer 1 · edited May 23 '17 at 12:05

5

Have you tried valgrind (and its memcheck)?

$ valgrind --tool=memcheck --leak-check=full ./yourprogram

(and preferably compile your program with -g)

Edit: sorry, I didn't read that you didn't want to use Valgrind, but as pointed out in the comment from dureuill on your post, it is very useful, and learning it is worth the time.

Another information: a memory leak is caused by a missing free after some malloc or realloc (you can see here a simple example in C). You can also use grep (with -n to get the line and -r for a recursive search) to list all your memory allocation lines in your program; and try to match each of them with a call to free. However, this can be tedious, and I truly believe that using Valgrind would be faster.

edited May 23 '17 at 12:05

Community

1
1

answered Apr 13 '14 at 20:16

7heo.tk

1,074
12
23

+1 refusing to learn how to use valgrind is like saying "Yes I know there's a tool specifically designed to do this but I'm too lazy to learn it" – Kristopher Micinski Apr 13 '14 at 20:31
Plus, valgrind can show the exact line the faulty memory allocation is at; given that the program is compiled with a symbols table (`-g` on GCC). – 7heo.tk Apr 13 '14 at 20:33
ok. I get it. I'm lazy. thats fair. How exactly do I use valgrind to show the exact line the faulty memory allocation --- oh. I'm pretty sure the memory allocation at that specific moment in runtime is fine but gets corrupted by something else as time goes on. – jason dancks Apr 13 '14 at 21:24
@jasondancks : dureuill did a complete answer for that question, you just have to read it. Also, memory allocation can not get "corrupted", and is *always* fine: it's the missing `free` calls that cause a leak. 1. Find the un-freed malloc, 2. free the memory when processing is done, 3. enjoy your leak-free code. – 7heo.tk Apr 13 '14 at 22:13
I'm looking through his answer right now. I'm sorry I was so anti-valgrind. I clearly need to invest time in learning it. I was referring to the pointer being corrupted, not the block of memory its pointing to. I think I understand whats going on now. The value of the invalid memory address is 0x8, in my code whenever I free memory I always change the pointer to 0, if I try to access (the 8 bytes was referring to a char in my struct) then the passed in struct is a null pointer. – jason dancks Apr 13 '14 at 22:29

dureuill · Accepted Answer · 2014-04-13T22:01:40.840

From your valgrind output:

This is likely what causes your problem:

==7682== Invalid read of size 8
==7682==    at 0x100001D4E: printrelation2 (relation.h:490)
==7682==    by 0x100003CFE: main (test-computation.c:47)
==7682==  Address 0x10000cee8 is 8 bytes inside a block of size 24 free'd
==7682==    at 0xD828: free (vg_replace_malloc.c:450)
==7682==    by 0x100001232: destroyrelation2 (relation.h:161)
==7682==    by 0x100003407: destroyallhashmap (computation.h:333)
==7682==    by 0x1000039E1: closure (computation.h:539)
==7682==    by 0x100003CBE: main (test-computation.c:38)

Let's go in depth

==7682== Invalid read of size 8
==7682==    at 0x100001D4E: printrelation2 (relation.h:490)
==7682==    by 0x100003CFE: main (test-computation.c:47)

This is a summary for your error. You access an unallocated (or previously allocated then deallocated) memory location of 8 bytes in printrelation2, at line 490 of relation.h.

==7682==  Address 0x10000cee8 is 8 bytes inside a block of size 24 free'd

The accessed address is 8 bytes long inside a block of size 24, ie probably a field of size 8 bytes in a structure of size 24 (look for such a structure), an you previously freed this address.

==7682==    at 0xD828: free (vg_replace_malloc.c:450)
==7682==    by 0x100001232: destroyrelation2 (relation.h:161)
==7682==    by 0x100003407: destroyallhashmap (computation.h:333)
==7682==    by 0x1000039E1: closure (computation.h:539)
==7682==    by 0x100003CBE: main (test-computation.c:38)

This is the stack of the calls that resulted in freeing the address you referenced at the point where your program crashes. It starts with a free, which is normal because you probably used the free function to deallocate memory. However the file and line are the standard library, so not very relevant. What is relevant though is that this free is called from destroyrelation2 at line 161 in relation.h, and this is the faulty free. destroyrelation2 itself is called by destroyallhashmap, which is called by closure, which is called by main at line 38 of test-computation.c. You need to find out what mistake in your allocations causes that you reuse a pointer in printrelation2 that you freed previously in main at line 38.

The memory leak reported afterwards exists, but is not likely what causes your crash.

Is valgrind output clearer now?

Note 1: This memory leaks report may changes after you fix your segfault, but as it is now, here's how I interpret it:

==7682== 48 (24 direct, 24 indirect) bytes in 1 blocks are definitely lost in loss record 33 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100001104: getnewrelation (relation.h:134)
==7682==    by 0x100001848: copyrelation (relation.h:343)
==7682==    by 0x100003991: closure (computation.h:531)
==7682==    by 0x100003CBE: main (test-computation.c:38)
==7682== 
==7682== 1,128 (24 direct, 1,104 indirect) bytes in 1 blocks are definitely lost in loss record 36 of 37
==7682==    at 0xC283: malloc (vg_replace_malloc.c:274)
==7682==    by 0x100002315: getnewholder (dependency.h:129)
==7682==    by 0x100003B17: main (test-computation.c:14)
==7682== 
==7682== LEAK SUMMARY:
==7682==    definitely lost: 48 bytes in 2 blocks
==7682==    indirectly lost: 1,128 bytes in 44 blocks
==7682==      possibly lost: 0 bytes in 0 blocks
==7682==    still reachable: 4,096 bytes in 1 blocks
==7682==         suppressed: 88 bytes in 1 blocks

Let's start with the summary:

==7682== LEAK SUMMARY:
==7682==    definitely lost: 48 bytes in 2 blocks
==7682==    indirectly lost: 1,128 bytes in 44 blocks
==7682==      possibly lost: 0 bytes in 0 blocks
==7682==    still reachable: 4,096 bytes in 1 blocks
==7682==         suppressed: 88 bytes in 1 blocks

You have two blocks of allocated memory which are not accessible through any pointers. That means that somewhere in the program, you malloc them and at some later point you totally forget about them. Those are bad memory leaks. You need to review your logic in order to keep a handle on these blocks, or to free them sooner in the program life. I'm unsure about indirectly lost, I'd say you don't have direct handles to your blocks, but you have pointers to structures that owns handles to the blocks. Those memory leaks can be mitigated by freeing the pointers in the structures before exit. I don't know about "possibly lost" and never had one with valgrind. "still reachable" are good memory leaks, ie at the point where valgrind crashed, you did not freed the still reachable block, but you have a pointer to it and you can easily add a call to free that pointer and solve the memory leak.

The two call stacks show you the malloc that are result in memory leaks, minus the "still reachable" leaks (to see them, you must add the options --leak-check-full --show-reachable=yes to your valgrind invocation.

Note 2: Avoid function names like destroyallhashmap (hard to read) or destroyrelation2 (numbered). Prefer destroy_all_hashmap or the less usual (in C) destroyAllHashmap and avoid numbering your functions. Similarly, avoid variable names like nnew, but use semantically sensible variable names.

Thank you very much. My code no longer crashes, My problem was I passing references as opposed to deep copies. I still have no idea how the linked list being returned got corrupted... but its not anymore that's what counts. — jason dancks, Apr 14 '14 at 02:04

fkl · Answer 3 · 2014-04-13T22:04:58.680

4

Since i saw Valgrind all over the suggestions, i would recommend a few other more generic ones that have proven useful over time.

Narrow down the code to look for bug

First, it is harder to use any tool / keep track of a large system. Try to narrow down the issue.

For e.g. Turn off modules (comment out pieces of code and see if you still keep getting the issue produced). Some hit and trials should get you to eliminate a major chunk of your code unless it is a real nasty random memory corruption.

Remove Dynamic Memory or At least Comment out memory de-allocation

Try commenting out "memory free" calls (if your situation can avoid overflowing system memory). This way you can at least eliminate or narrow down issues with deallocs. Better yet, try to run your whole system using statically allocated memory. I know it might not be very practical, but once you have a limited scope producing the crash consistently, you might be able to allocate a static large enough memory instead of needing dynamic ones. May be create an array of nodes and assign those to your pointers.

Call Stack and watch at the crash position

I assume that you already checked your call stack at the point of crash and validated the locally available pointers. That should have been the very immediate approach BEFORE trying any of above.

edited Apr 13 '14 at 22:04

answered Apr 13 '14 at 21:59

fkl

5,412
4
28
68

2

+1 for the right thing to do in an environment where valgrind is not available or impracticable. – dureuill Apr 13 '14 at 22:19
Yeah, working with C is often not on desktop environment, which effectively means limited tools, apis and basically far less of every thing we take for granted. – fkl Apr 13 '14 at 23:23
However, I'd say that dynamic memory allocation such as malloc/free should generally be avoided in embedded environments that typically don't support valgrind. I was rather thinking of Windows platform (where valgrind is not available) or of certain applications such as GUI (that often have their own memory model) or computationally expensive algorithms. In the latter context, one could still use valgrind by isolating specific functions of the algorithm, but it's not always easy to do, as leaks will typically occur when mixing functions in the application. – dureuill Apr 14 '14 at 05:45
True, often valgrind simply is not ported to the platform. Regarding dynamic memory it should be avoided as much as possible, but again you could always have business requirements which make it unavoidable. There are situations where trying to use standard tools such as valgrind, memwatch (for dynamic allocations) is simply unrealistic even if they are available on a platform. – fkl Apr 14 '14 at 15:05

c: strategies for debugging obscure memory leaks?

3 Answers3