Debugging a memory leak that doesn't show on heap profiling

Question

I'm working on a Haskell daemon that receives and processes JSON requests. While the operations of the daemon are complex, the main structure is intentionally kept simple: Its internal state is just an IORef with a data structure and all threads perform atomic operations on this IORef. Then there are a few threads that upon a trigger take the value a do something with it.

The problem is that the daemon is leaking memory and I can't find out why. It's certainly related to the requests: when the daemon is getting several requests per second, it leaks something like 1MB/s (as reported by the Linux tools). The memory consumption steadily increases. With no requests, the memory consumption remains constant.

What puzzles me that none of this shows in GHC profiling. Either I'm missing something in the profiling parameters, or the memory is consumed by something else:

Run with +RTS -hc -xt -p:

screenshot of profiler output

Run with +RTS -hr -xt -p:

screenshot of profiler output

During this testing run, the daemon subsequently consumes over 1GB. So the profiling data clearly don't correspond to the actual consumed memory by orders of magnitude. (I understand that the RTS, the GC and the profiling itself add to the real memory consumption, but this difference is too big, and doesn't correspond to the ever-increasing consumption.)

I already tried to rnf all the state data of the daemon inside the IORef, as well as parsed JSON requests (to avoid parts of JSON strings be retained somewhere), but without much success.

Any ideas or suggestions welcomed.

Update: The daemon is running without -threaded, so there are no OS-level threads.

The GC statistics are much closer to the heap profiling than to the numbers reported by Linux:

    Alloc    Copied     Live    GC    GC     TOT     TOT  Page Flts
    bytes     bytes     bytes  user  elap    user    elap
[...]
  5476616     44504   2505736  0.00  0.00   23.21  410.03    0    0  (Gen:  0)
 35499296     41624   2603032  0.00  0.00   23.26  410.25    0    0  (Gen:  0)
 51841800     46848   2701592  0.00  0.00   23.32  410.49    0    0  (Gen:  0)
 31259144     36416   2612088  0.00  0.00   23.40  410.61    0    0  (Gen:  0)
 53433632     51976   2742664  0.00  0.00   23.49  412.05    0    0  (Gen:  0)
 48142768     50928   2784744  0.00  0.00   23.54  412.49    0    0  (Gen:  0)
[...]

Update 2: I found the origin of the problem, the memory leak is caused by handleToFd (see this issue for the unix library). I just wonder how it'd be possible to more effectively pinpoint such a leak (perhaps occurring in a foreign piece of code).

I'm going to guess the leaking memory is in bytestrings allocated outside the GHC heap. Heap profiles only show what GHC knows about, after all. — Carl, Jul 24 '14 at 17:59
@Carl And is there a way how to somehow get grasp about the `ByteString` allocation? — Petr, Jul 24 '14 at 18:01
Nothing I know of, other than understanding what's going on in your FFI interfaces. Be suspicious of whatever libraries you're using to do networking. — Carl, Jul 24 '14 at 19:40
Are you using `atomicModifyIORef'` (i.e. the strict version). If not, then try that. If so I would guess that it has to do with contention, where maybe profiling overheads are causing contention to be lessened; there seems to be some weirdness with `atomicModifyIORef` possibly having to do with lazy blackholes, see .e.g. [here](https://ghc.haskell.org/trac/ghc/ticket/8972#comment:10). I don't know if that could manifest as a space leak or not, but you might experiment with [these](http://hackage.haskell.org/package/atomic-primops-0.6.0.5/docs/Data-Atomics.html#g:2) from atomic-primops. — jberryman, Jul 24 '14 at 20:14
what do you see reported by `+RTS -s`? Are those numbers closer to the heap profile or the system-reported memory usage? — John L, Jul 25 '14 at 00:24
@jberryman No, because the content of my `IORef` is a structured data type, so I'd only force the evaluation of the outer-most constructor. But I'm forcing the whole content to its normal form using `rnf` periodically, without much effect. — Petr, Jul 25 '14 at 05:51
@JohnL It's much closer to the heap profile. The amount of live bytes is around 3MB and the number of allocated bytes is around 50MGB. (I added a sample of these statistics to the question.) — Petr, Jul 25 '14 at 08:11
Can you paste the whole first block (~5 lines) of `+RTS -s` output? There are other numbers in there that may be informative like slop and fragmentation. — Reid Barton, Jul 26 '14 at 02:01

score 2 · Answer 1 · edited Jan 10 '15 at 19:54

While I am not familiar with Haskell daemon itself, answering your question "how it'd be possible to more effectively pinpoint such a leak", it might be possible to use

valgrind --leak-check=yes haskelldaemon (better if you compile it with debug info),

OR, if the leak happens in shared library, try

LD_PRELOAD="yourlibrary.so" valgrind your-executable.

Debugging a memory leak that doesn't show on heap profiling

1 Answers1

Linked