Handling large amounts of data in C++, need approach

Question

So I have a 1GB file in a CSV format like so, that I converted to a SQLite3 database

column1;column2;column3
1212;abcd;20090909
1543;efgh;20120120

Except that I have 12 columns. Now, I need to read and sort this data and reformat it for output, but when I try to do this it seems I run out of RAM (using vectors). I read it in from SQLite and store each line of the file in a struct which is then pushed back to a deque. Like I said, I run out of memory when the RAM usage approaches 2gb, and the app crashes. I tried using STXXL but apparently it does not support vectors of non-POD types (so it has to be long int, double, char etc), and my vector consists mostly of std::string's, some boost::date's and one double value.

Basically what I need to do is group all "rows" together that has the same value in a specific column, in other words, I need to sort data based on one column and then work with that.

Any approach as to how I can read in everything or at least sort it? I would do it with SQLite3 but that seems time consuming. Perhaps I'm wrong.

Thanks.

score 1 · Answer 1 · answered Mar 20 '14 at 18:49

In order of desirability:

don't use C++ at all, just use sort if possible
if you're wedded to using a DB to process a not-very-large csv file in what sounds like a not-really-relational way, shift all the heavy lifting into the DB and let it worry about memory management.
if you must do it in C++:
- skip the SQLite3 step entirely since you're not using it for anything. Just map the csv file into memory, and build a vector of row pointers. Sort this without moving the data around
- if you must parse the rows into structures:
  - don't store the string columns as std::string - this requires an extra non-contiguous allocation, which will waste memory. Prefer an inline char array if the length is bounded
  - choose the smallest integer size that will fit your values (eg, uint16_t would fit your sample first column values)
  - be careful about padding: check the sizeof your struct, and reorder members or pack it if it's much larger than expected

score 1 · Answer 2 · answered Mar 20 '14 at 19:25

If you want to stick with the SQLite3 approach, I recommend using a list instead of a vector so your operating system doesn't need to find 1GB or more of continuous memory.

If you can skip the SQLite3 step, here is how I would solve the problem:

Write a class (e.g. MyRow) which has a field for every column in your data set.
Read the file into a std::list<MyRow> where every row in your data set becomes an instance of MyRow
Write a predicate which compares the desired column
Use the sort function of the std::list to sort your data.

I hope this helps you.

I'd consider a [segmented_array](http://i42.co.uk/stuff/segmented_array.htm) or a std::deque before a list. — wood_brian, Mar 20 '14 at 22:01

score 0 · Answer 3 · answered Mar 20 '14 at 18:46

There is significant overhead for std::string. If your struct contains a std::string for each column, you will waste a lot of space on char * pointers, malloc headers, etc.

Try parsing all the numerical fields immediately when reading the file, and storing them in your struct as ints or whatever you need.

If your file actually contains a lot of numeric fields like your example shows, I would expect it to use less than the file size worth of memory after parsing.

score 0 · Answer 4 · answered Mar 20 '14 at 19:27

Create a structure for your records.

The record should have "ordering" functions for the fields you need to sort by.

Read the file as objects and store into a container that has random-access capability, such as std::vector or std::array.

For each field you want to sort by: Create an index table, std::map, using the field value as the key and the record's index as the value.

To process the fields in order, choose your index table and iterate through the index table. Use the value field (a.k.a. index) to fetch the object from the container of objects.

If the records are of fixed length or can be converted to a fixed length, you could write the objects in binary to a file and position the file to different records. Use an index table, like above, except use file positions instead of indices.

I did this, but my RAM exceeded 2GB. That's why I asked for an alternative approach, but thanks for answering anyway! — user3195734, Mar 21 '14 at 19:32

score 0 · Answer 5 · answered Mar 21 '14 at 19:36

Thanks for your answers, but I figured out a very fast and simple approach.

I let SQLite3 do the job for me by giving it this query:

SELECT * FROM my_table ORDER BY key_column ASC

For a 800MB file, that took about 70 seconds to process, and then I recieved all the data in my C++ program, already ordered by the column I wanted them grouped by, and I processed the column one group at a time, and outputted them one at a time in my desired output format, keeping my RAM free from overload. Total time for the operation about 200 seconds, which I'm pretty happy with.

Thank you for your time.

Handling large amounts of data in C++, need approach

5 Answers5