I have a bunch of big files, each file can be over 100GB, the total amount of data can be 1TB and they are all read-only files (just have random reads).
My program does small reads in these files on a computer with about 8GB main memory.
In order to increase performance (no seek() and no buffer copying) i thought about using memory mapping, and basically memory-map the whole 1TB of data.
Although it sounds crazy at first, as main memory << disk, with an insight on how virtual memory works you should see that on 64bit machines there should not be problems.
All the pages read from disk to answer to my read()s will be considered "clean" from the OS, as these pages are never overwritten. This means that all these pages can go directly to the list of pages that can be used by the OS without writing back to disk OR swapping (wash them). This means that the operating system could actually store in physical memory just the LRU pages and would operate just reads() when the page is not in main memory.
This would mean no swapping and no increase in i/o because of the huge memory mapping.
This is theory; what I'm looking for is any of you who has every tried or used such an approach for real in production and can share his experience: are there any practical issues with this strategy?