how to memory map a huge matrix?

Question

Suppose you got a huge (40+ GB) feature value (floating-point) matrix, rows are different features and columns are the samples/images.

The table is precomputed column-wise. Then it is completely accessed row-wise and multi-threaded (each thread loads a whole row) several times.

What would be the best way to handle this matrix? I'm especially pondering over 5 points:

Since it's run on an x64 PC I could memory map the whole matrix at once but would that make sense?
What about the effects of multithreading (multithreaded initial computation as well?)?
How to layout the matrix: row or column major?
Would it help to mark the matrix as read-only after the precomputation has been finished?
Could something like http://www.kernel.org/doc/man-pages/online/pages/man2/madvise.2.html be used to speed it up?

This question might be closed for being *too interesting* for SO -- but I hope not. Is there a constraint on the operating system? (Guessing Linux from the link.) — , Jan 29 '11 at 20:41
I don't get why it could be closed, is there some rule I missed? Yep, the software is currently restricted to Linux. But answers regarding Windows are welcome as well. — Trass3r, Jan 29 '11 at 23:55

Phil Miller · Accepted Answer · 2011-02-03T02:36:05.777

Memory mapping the whole file could make the process much easier.

You want to lay out your data to optimize for the most common access pattern. It sounds like the data is going to be written once (column-wise) and read several times (row-wise). That suggests the data should be stored in row-major order.

Marking the matrix read-only once the pre-computation is done probably won't help performance (there are some possible low-level optimizations, but I don't think anything implements them), but it will prevent bugs from accidentally writing to data you don't intend to. Might as well.

madvise could end up being useful, once you've got your application written and working.

My overall advice: write the program in the simplest way you can, sequentially at first, and then put timers around the whole thing and the various major operations. Make sure the major operation times sum to the overall time, so you can be sure you're not missing anything. Then target your performance improvement efforts toward the components that are actually taking the most time.

Per JimR's mention of 4MB pages in his comment, you may end up wanting to look into hugetlbfs or using a Linux Kernel release with transparent huge page support (merged for 2.6.38, could probably be patched into earlier versions). This would likely save you a whole lot of TLB misses, and convince the kernel to do the disk IO in sufficiently large chunks to amortize any seek overhead.

If you don't access the memory correctly you could end up in a thrash fest. Make sure you measure page faults in/out if you find this slow. zvrba covers some of the problems you'll see in his answer, particularly #3. I worked on something similar in the early 90s (200ish to 1G) and the thrashing from faulting things in and out ruined it completely. This was at a time when 64MB of RAM was considered maxxed out. You can reduce the thrashing (by reducing overhead) if you can change the page size from 4096 to, I think, 4MB. — JimR, Jan 29 '11 at 23:12
At > 40Gb, I think we can assume it's too big for main memory. So a naive implementation (as is being suggested here) will indeed lead to a "thrash fest". — Tim Cooper, Apr 08 '11 at 09:34
I may be spoiled, but I do have access to machines with more RAM than that. Regardless, unless the computation phase is really heavy, just reading the data sequentially will take as much time as the rest of the program. The sensible 'naive' implementation would read the data sequentially, and so get essentially full performance at that limit. — Phil Miller, Apr 08 '11 at 21:56

score 3 · Answer 2 · answered Jan 29 '11 at 21:27

Maybe, see below.
The size of the total working set of all threads must not exceed available RAM, otherwise the program will run at snail speed because of swapping.
Layout should match access patterns, as long as condition 2 is respected.
What do you mean by "mark as read only"?
Measure it.

Re 3: If you have, e.g., 8 CPUs but do not have enough RAM to load 8 rows, you should make each thread process its row sequentially in manageable chunks. In this case, block-layout of a matrix would make sense. If the thread MUST have the whole row in memory to process it, I'm afraid that you can't use all the CPUs, as the process will start thrashing, i.e., kicking out some subset of the matrix out of the ram and reloading another needed subset. This is slightly less bad than full swapping as the matrix is never modified, so the contents of the pages do not need to be written to the swap file before being kicked out. But it still hurts performance badly.

Also, doing random access I/O from multiple threads is a bad idea, which is what you'll end up doing if you use mmap(). You have (presumably) only a single disk, and parallel I/O will just make it slower. So mmap() might not make sense and you could achieve better I/O performance by reading data sequentially into ram.

Note that 40GB is approximately 10.5 million pages of 4096 bytes. By doing mmap(), you will, in the worst case, slow down computation by that many hard disk seeks. At 8ms per seek (taken from wikipedia), you'll end up wasting 83666 seconds, i.e., almost a whole day!

Well, a single row is in the order of a few MB plus I got 12GB RAM, so that's not the problem. — Trass3r, Jan 30 '11 at 00:21
Ok. But mmapping will still potentially generate a lot of random I/O. — zvrba, Jan 30 '11 at 08:16

Tim Cooper · Answer 3 · 2011-04-08T10:02:11.703

If you could fit the whole thing into main memory, then yes: memory map it all, and it doesn't matter whether it's column major or row major. However, at 40+ Gb, I'm sure it's too big for main memory. In which case:

No, don't map the whole thing! At least, don't expect the memory to work like normal memory if you map it all. Your program will take forever if you don't properly deal with the i/o issues.
The multi-threaded access issue is solved if you store it row-major (it sounds like you don't have multi-threaded column writes).
You should lay it out row-wise, assuming each cell is written once and then read many times.
Yes, I think it would help to mark the matrix as read-only after it's been written, but purely as a way to prevent bugs (accidental writes). It won't affect performance.
No, no amount of clever kernel read-ahead is going to solve your performance problems. You need to solve it at the algorithm level.

I think you are going to have a performance problem with a naive implementation. Either the computer with thrash while writing (if you store it row major) or it will thrash while querying (if you store it column major). The latter is presumably worse, but it's a problem both ways.

The right solution is to use an intermediate representation which is neither row-major nor column-major but 'large squares'. Take the first 50,000 columns and store them in a memory-mapped file (phase 1). It doesn't matter if it's column major or row major since it'll be purely memory resident. Then, take each row and write it into the final row-major memory-mapped file (phase 2). Then repeat the cycle for the next 50,000 columns, and so on.

how to memory map a huge matrix?

3 Answers3