boost multi_index_container and memory fragmentation

Question

I'm using MIC for my LRU cache in server, it has replaced list/map LRU since I was suspecting that this is what caused some unexplained memory footprint. Memory leaks are out of picture, at least no tool has found any leak as well as code inspection. Since I've started to use MIC the picture improved (that the only proof of memory fragmentation) but not enough. We are talking about several Gb of cache, millions of records being ejected from it on daily basis. After two to three weeks the problem becomes clear to see - if I empty the cache the process still holds unexplained 2-3Gb of memory.
My container is quite simple:

typedef std::pair<Key, T> Element;
    typedef mic::multi_index_container<
        Element,
        mic::indexed_by<mic::sequenced<mic::tag<struct Seq>>,
                        mic::hashed_unique<mic::tag<struct Hash>, mic::member<Element, const Key, &Element::first>>>>
        item_list;

it uses erase and push_front to insert new entry (or overwrite old one) and then if needed ejects element from tail. The question is is it worth trying to use replace and relocate instead of push_front?

UPDATE001: Ok, the new version is up and running, I see that realocate significantly improved the situation, the memory footprint after 3 weeks was ~1/1.5 Gb less than footprint on machines without the change. Now it deployed on all machines worldwide. As second stage there are numerous changes to the cache invalidation machinery. Less ejections and re-insertions should improve the situation too (in case it is really memory fragmentation)

just asking - are you aware of how malloc/free work, and why it does not matter that process memory is never released once allocated? — Richard Hodges, Nov 02 '15 at 19:30
it doesn't matter how alloc/free works since they are not in charge of allocation/release of memory, the OS memory manager is. In this particular case VirtualAllocEx will kick in, deciding how much memory will be allocated (yep, even this is not done by alloc) and indeed free do not guarantee that the memory will be actually release back to OS. however you can force the OS to regain freed but not release back to OS memory\ — kreuzerkrieg, Nov 02 '15 at 19:44
right... so why are you surprised that the memory footprint of the process does not go down? Or am I misunderstanding the problem? — Richard Hodges, Nov 02 '15 at 19:46
it goes down, lets say from 7Gb to 3Gb, when 1.2Gb is an initial allocation. then I force OS to try and release all unallocated pages held by process, it doesn't work (it works to some extent) so I guess it is fragmentation. I dont have a proof, but changing one implementation of LRU to another improved the memory state of the process, and MIC is known for less defragmenting the memory, so, I guess it strengthens my theory — kreuzerkrieg, Nov 02 '15 at 19:54
I see. So maybe a custom allocator for the cached objects (that guarantees that the cached objects are always stored in pages reserved to them) will help you? — Richard Hodges, Nov 02 '15 at 20:26
Exactly how do you establish the "fact" that 2-3GiB of RAM are "being held"? — sehe, Nov 02 '15 at 21:28
@RichardHodges, maybe, the question is what it takes to implement such an allocator — kreuzerkrieg, Nov 03 '15 at 04:40
I don't think `replace`/`relocate` will show better memory behavior than `erase`/`push_front` (any sane allocator will reuse freshly destroyed nodes), but seems like trying it is easy enough, right? Why don't you make the test? — Joaquín M López Muñoz, Nov 03 '15 at 07:15
@JoaquínMLópezMuñoz thats it, it will take 3 weeks or so to validate. I was hopping someone knows the answer. Ok, will try to upload test version to production nearest Sunday — kreuzerkrieg, Nov 03 '15 at 07:24
can't you test the idea with random allocations/deallocations played at an artificially high rate? You don't need all the comms/logic that is in your server application — Richard Hodges, Nov 03 '15 at 07:56
@RichardHodges, possible, but a good set of data which is really random should be generated, otherwise it will not emulate real life scenario. Such a set with different data and variable load exists in (surprise!) production :) but of course running stand alone test is the most clean option to test it. will do if production test yield no result — kreuzerkrieg, Nov 03 '15 at 08:15
When I used to write fx trading code we used to replay logs through the test harness. that allows you simulate real-life traffic in accelerated time. Best of luck. — Richard Hodges, Nov 03 '15 at 10:41
This is what we do too, but it is complicated, have to replicate production DB to dev env, etc. Thanks! — kreuzerkrieg, Nov 03 '15 at 15:32

score 0 · Answer 1 · answered Dec 19 '16 at 07:59

We have experienced the same thing. I wrote a little test program that uses our cache from 300 threads. It keeps inserting and eraseing at approximately 200k (insert+erase)/sec and I ran it over the weekend. I examined the memory usage via pmap -x total RSS

pmap -x [pid] | tail -n1 | awk '{print $4}'

In a one minute resolution, one can see that until the cache is loaded the memory consumption is ~linear from 0 to 4.7GB, and after that it's ever increasing in a ~logarithmic speed. As the figure shows. (It took 14 minutes to make the cache full).

One more interesting thing is that pmap -x reported a lot of 65536k chunks of virtual memory that were getting loaded (so there might be a theoritical maximum for this excessive memory usage), but if I did the same thing from one thread the test program allocated a single 4.7GB chunk and after the cache got full, the memory usage was constant.

BTW, we abandoned the LRU stuff altogether and switched to unbound unordered_map. Well, we didnt get too far from the memory management problems. Look here http://stackoverflow.com/questions/39313820/stdunordered-map-does-not-release-memory — kreuzerkrieg, Dec 19 '16 at 11:50

boost multi_index_container and memory fragmentation

1 Answers1