Dealing with very large datasets & just in time loading

Question

I have a .NET application written in C# (.NET 4.0). In this application, we have to read a large dataset from a file and display the contents in a grid-like structure. So, to accomplish this, I placed a DataGridView on the form. It has 3 columns, all column data comes from the file. Initially, the file had about 600.000 records, corresponding to 600.000 lines in the DataGridView.

I quickly found out that, DataGridView collapses with such a large data-set, so I had switch to Virtual Mode. To accomplish this, I first read the file completely into 3 different arrays (corresponding to 3 columns), and then the CellValueNeeded event fires, I supply the correct values from the arrays.

However, there can be a huge (HUGE!) number of records in this file, as we quickly found out. When the record size is very large, reading all the data into an array or a List<>, etc, appears to not be feasible. We quickly run into memory allocation errors. (Out of memory exception).

We got stuck there, but then realized, why read the data all into arrays first, why not read the file on demand as CellValueNeeded event fires? So that's what we do now: We open the file, but do not read anything, and as CellValueNeeded events fire, we first Seek() to the correct position in the file, and then read the corresponding data.

This is the best we could come up with, but, first of all this is quite slow, which makes the application sluggish and not user friendly. Second, we can't help but think that there must be a better way to accomplish this. For example, some binary editors (like HXD) are blindingly fast for any file size, so I'd like know how this can be achieved.

Oh, and to add to our problems, in virtual mode of the DataGridView, when we set the RowCount to the available number of rows in the file (say 16.000.000), it takes a while for the DataGridView to even initialize itself. Any comments for this 'problem' would be appreciated as well.

Thanks

score 5 · Accepted Answer · answered Jan 26 '11 at 16:49

If you can't fit your entire data set in memory, then you need a buffering scheme. Rather than reading just the amount of data needed to fill the DataGridView in response to CellValueNeeded, your application should anticipate the user's actions and read ahead. So, for example, when the program first starts up, it should read the first 10,000 records (or maybe only 1,000 or perhaps 100,000--whatever is reasonable in your case). Then, CellValueNeeded requests can be filled immediately from memory.

As the user moves through the grid, your program as much as possible stays one step ahead of the user. There might be short pauses if the user jumps ahead of you (say, wants to jump to the end from the front) and you have to go out to disk in order to fulfill a request.

That buffering is usually best accomplished by a separate thread, although synchronization can sometimes be an issue if the thread is reading ahead in anticipation of the user's next action, and then the user does something completely unexpected like jump to the start of the list.

16 million records isn't really all that many records to keep in memory, unless the records are very large. Or if you don't have much memory on your server. Certainly, 16 million is nowhere near the maximum size of a List<T>, unless T is a value type (structure). How many gigabytes of data are you talking about here?

Hello Jim, T, is a struct with 4 double-precision floats. So, 4*8*16M = 512MB of data. — SomethingBetter, Jan 27 '11 at 08:07
I tried using .NET MemoryMappedFile, but as soon as you create a view, it apparently tries to load the file into memory, because I got out of memory exceptions. I thought maybe MemoryMappedFile would internally segment the data accesses to pages and only load required pages to memory. — SomethingBetter, Jan 27 '11 at 14:30
@SomethingBetter: I guess 512 MB is a problem if you're on a 32-bit machine. If you use a memory mapped file, you'll want to make your view into the file smaller than the whole file size. Then you adjust your view as the user pages through the data. — Jim Mischel, Jan 27 '11 at 14:33

score 4 · Answer 2 · answered Jan 27 '11 at 14:35

Well, here's a solution that appears to work much better:

Step 0: Set dataGridView.RowCount to a low value, say 25 (or the actual number that fits in your form/screen)

Step 1: Disable the scrollbar of the dataGridView.

Step 2: Add your own scrollbar.

Step 3: In your CellValueNeeded routine, respond to e.RowIndex+scrollBar.Value

Step 4: As for the dataStore, I currently open a Stream, and in the CellValueNeeded routine, first do a Seek() and Read() the required data.

With these steps, I get very reasonable performance scrolling through the dataGrid for very large files (tested up to 0.8GB).

So in conclusion, it appears that the actual cause of the slowdown wasn't the fact that we kept Seek()ing and Read()ing, but the actual dataGridView itself.

It is true. Displaying the same DataSet in a TextBox (with a help od StringBuilder(5000000) ;)) ) is about 4 times faster. — TomeeNS, Aug 22 '16 at 22:45

score 1 · Answer 3 · answered Jan 26 '11 at 17:05

Managing rows and columns that can be rolled up, sub-totalled, used in multi-column calculations, etc presents a unique set of challenges; not really fair to compare the problem to the ones an editor would encounter. Third-party datagrid controls have been addressing the problem of displaying and manipulating large datasets client-side since VB6 days. It's not a trivial task to get really snappy performance using either load-on-demand or self-contained client-side garguantuan datasets. Load-on-demand can suffer from server-side latency; manipulating the entire dataset on the client can suffer from memory and CPU limits. Some third-party controls that support just-in-time loading supply both client-side and server-side logic, while others try to solve the problem 100% client-side.

score 1 · Answer 4 · answered Jan 30 '12 at 21:20

1

Because .net is layered on top of the native OS, runtime loading and management of data from disk to memory needs another approach. See why and how: http://www.codeproject.com/Articles/38069/Memory-Management-in-NET

answered Jan 30 '12 at 21:20

Zarmac

11
2

score 0 · Answer 5 · answered Mar 30 '14 at 21:44

To deal with this issue, I would suggest to do not load all data at once. Instead load data in chunks and display the most relevant data when needed. I just did a quick test and found that setting a DataSource property of a DataGridView is a good approach, but with large number of rows it also takes time. So use Merge function of DataTable to load data in chunks and show user the most relevant data. Here i have demonstrated an example which can help you.

Dealing with very large datasets & just in time loading

5 Answers5