access data from files on disc in real time

Question

I have the following problem to solve. I have to build a graph viewer to view a massive data set.

We have some files in a particular format that has millions of records representing the result of an experiment. Each record represents a sample point on a large graph plot. The biggest file I have seen has 43.7 Million records.

An average file contains 10 Million records. Each record is small (76 Bytes + optional 12 Bytes each). The complete data cannot be loaded in to the main memory as it is too large. I have build a new file format that compresses the data to 48 bytes per record and organises the data in to chunks that are associated to each other. I want to "view" the data by displaying the records in a 2D/3D plot. As the data is very dense, I would like to progressively increase the level of detail by loading more data and removing data that is not shown in the view from the main memory.

I would also like to access group of associated records in real time and pre-load similar records in order to keep the loading time to bare minimum. This will give the user a smooth control to view the data instead of an experience similar to viewing a video on YouTube with a very slow internet connection. the user cannot randomly and has to use the controls to navigate and I would like to use this info to load the relevant records into the main memory.

The data has to be loaded progressively from the disc based on what is currently in the main memory. Records in the main memory that are not required in the current context can be removed and if required re loaded.

How to I access data from a disc at high speeds based on some hash number
How do I manage main memory if the data to be viewed in the current context is too large. If your answer is level of detail, then how do I build it for a large data set and should this data be part of the file ?

I have been working on this for the last two weeks and I seem to get stuck due to IO speed.

I am working in native C++ and I cannot use work under GPL. If you need any more info, let me know.

Ram

Further, I am free to change the file format and organise data to suit my end. I use OpenGL to view the data. — Ram, Feb 10 '12 at 07:59
Have you considered porting this data to a database? Any decent database would be able to solve all of these problems. PostgeSQL even has graphing related extensions that you may find helpful — Swiss, Feb 10 '12 at 08:11

score 1 · Accepted Answer · edited May 23 '17 at 11:48

Under most modern file systems (Linux, Unixes, Windows) you can map a file into memory.

Which means you can access the content of the file as if it was entirely in memory (eg you can use data[i++], strchr(data,..), etc) and it's the operating system that does the mapping between used memory and file. When you want to read some data that is not already in memory, the o/s will fetch it from the file. You should read this question's answer: Mmap() an entire large file

score 0 · Answer 2 · answered Feb 10 '12 at 08:13

I think you are looking for organization similar to what's used to store level geometry in games, just that you maybe (depending on how your program works and what data you need to show) need just one dimension. See Quadtree and similar methods (bottom of that article).

access data from files on disc in *real time*

2 Answers2

access data from files on disc in real time