MongoDB, NUMA hardware, page faults but enough RAM for working set, touch command or vmtouch/dd does not load into memory

Question

MongoDB 2.46 & 2.4.8

Use case:

Load up 100.000 documents on a collection with 2 indexes. Resident memory increases (mongostat), and no page faults happen.
Restart mongod. Resident memory is low (this is expected)
Try to 'preheat' mongo, with touch command db.runCommand({ touch: collection, data: true, index: true }) or other means (on OS, vmtouch / dd)
a) On this step, on my development machine (MacOS), I see in mongostat a lot of page faults trying to heat it up (expected) and the resident memory is raised. From that point on, any updates do not raise page faults
b) On a numa server (256 GB RAM), even though I started up mongo with this guide: http://docs.mongodb.org/manual/administration/production-notes/#mongodb-on-numa-hardware (note: I do not have superuser access. However, the 2nd step, echoing 0 in /proc/sys/vm/zone_reclaim_mode, is already 0 so I left it like that), I cannot seem to be able to pre-heat the memory with the 'touch' command. Nothing happens, even though it returns successfully. In mongostat, only 'mapped' and 'vsize' is getting higher, and resident memory is the same (35m). I even tried to load up the data files in OS's memory with vmtouch and dd commands. Only re-indexing the collection changed the resident memory.

The problem started a while after I began to load up data into the server. I do a lot of upserts and the performance was awesome in the beginning (3000 - 4000 upserts/sec). This was expected because the working set would be able to fit in memory. After 30.000.000 documents the process seems to make a lot of page faults and I do not know why. The data files are approx. 33GB and the performance is about 500 upserts/sec, with a lot of page faults. That should mean that the working set is not in memory. However, 256GB RAM should be more than enough. I tried the 'touch' command, but resident memory was low (I even restarted the mongod process, ran the touch command, and even though 'mapped' and 'vsize' skyrocketed to a lot of GB, resident memory kept low, 35m). I tried to reIndex the collection and voilà, resident memory went from 35m -> 20GB. However, again, I saw page faults. Then I tried to vmtouch the data files (or with dd). Again, a lot of page faults.

The problem is that I cannot have 'only' 500 upserts/sec. Should I change my application logic? I thought with 256GB memory my 'active' working set (expected 60GB) should fit in memory. I am in the middle (30GB) and it seems that I cannot do anything to fix this. Is it the numa hardware? Should I make any other changes?

Thanks in advance

I saw this before where touch does not actually send to resident but to cache if I remember right, it was in the google groups somewhere if you search google, it is expected behaviour and should be ok — Sammaye, Nov 15 '13 at 08:05
In addition I believe it won't show as in resident until it is actually read by mongod, a kind of clue I got from here: https://groups.google.com/forum/#!topic/mongodb-user/UfQoyllDNGU , I know it only relates to serverStatus but I think same applies here — Sammaye, Nov 15 '13 at 08:09
However, on my development machine (MacOS), the resident memory increased when I did a 'touch' on the data&index. On the server did not. Moreover, in the google groups question on your comment, the user there states that after the 'touch' command, the resident memory was 5821MB. On my case, the resident memory is around 35-90MB. With 33GB data, it should be more, I expect — ialex, Nov 15 '13 at 08:39
Oh yes you should, have you checked your readahead settings? — Sammaye, Nov 15 '13 at 08:46
Unfortunately, since I have no special access on the hardware (superuser), I cannot find it out. Unless there is another method. I have NOT seen any warning like that however: ** WARNING: Readahead for .. is set to 512KB **, so I think the readahead is ok, otherwise I would get a warning in the log — ialex, Nov 15 '13 at 09:16
This problem is kinda weird now that I actually read every last word. This probably won't show anything related to the memory at all but could you provide a db.stats() for the collection you are trying to touch()? It will give us an idea of the document size and fragmentation too. This is definitely an odd one that it suddenly stops being performant at 30m — Sammaye, Nov 15 '13 at 09:41
Collection stats are: `{ "ns" : "cortexDay.accumulatedData", "count" : 36425904, "size" : 24786208048, "avgObjSize" : 680.4555364775573, "storageSize" : 27372449648, "numExtents" : 33, "nindexes" : 2, "lastExtentSize" : 2146426864, "paddingFactor" : 1.0000000027731564, "systemFlags" : 0, "userFlags" : 0, "totalIndexSize" : 2320741248, "indexSizes" : { "_id_" : 1063713952, "d_1_m_1" : 1257027296 }, "ok" : 1 } ` (trying to format it, sorry) — ialex, Nov 15 '13 at 09:51
nope your stats are perfect not even a single byte of fragmentation, this must be something with numa specifically — Sammaye, Nov 15 '13 at 09:55
I thought so, but I cannot figure out why and how I may fix this. Only other option (if I cannot solve this), would be a virtual machine? I cannot think of anything else. — ialex, Nov 15 '13 at 10:01
I wouldn't think so, I run on AWS which is basically all virtualised servers, I have been reading a little: http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ it seems common for numa to page out (swap out in that case) memory pretty much as soon as it comes in which is weird and hugely damaging for any database esepcially one which is memory mapped. — Sammaye, Nov 15 '13 at 10:03
I was expecting that since I followed the instructions for running mongodb with interleave policy = all (thus, disabling numa on the process, `numactl --interleave=all /usr/bin/local/mongod..`), memory related problems (like this) would not happen. If they still happen, how can someone override it? `numactl --interleave=all ` is not overriding this? — ialex, Nov 15 '13 at 10:07
Indeed he mentions in interleave policy after I posted my comment. Hmm my knowledge of numa is peaking (I don't use it) but I'll keep reading to see if it could still be having an effect and how to solve it — Sammaye, Nov 15 '13 at 10:09

score 1 · Answer 1 · edited Apr 13 '17 at 12:13

I just wrote a pretty detailed answer over on ServerFault regarding resident memory, page faulting, and how to troubleshoot, tweak and tune etc. so I will not re-hash that here.

I will say that Sammaye's comment is correct, the touch (or dd, vmtouch etc.) command will not cause memory to be reported as resident agains the mongod process until the process actually accesses the data (until then it is just in the FS cache), and then you can hit the issue in SERVER-9415 which can cause resident memory to be under reported.

I think you are already looking at the key metrics here, and you should be able to achieve higher resident memory than you are reporting (or at least, get more data into memory without significant page faults being seen). The situation you are describing sounds like memory pressure from elsewhere but I am assuming you would have notices another process eating significant amounts of memory.

What I will note is that I have previously spent days (literally) attempting to make a particular AWS instance go above a 30% memory threshold without success.

When we finally gave up and tried on another instance, without changing a thing (we just added a new instance as a secondary and failed over to it) it instantly went to over 70% resident memory. Granted, that was on m2.4xlarge instances, so not at the same scale as yours, but it's always worth bearing in mind. If you can try it on another instance, I would recommend giving it a shot.

MongoDB, NUMA hardware, page faults but enough RAM for working set, touch command or vmtouch/dd does not load into memory

1 Answers1