Big csv file c++ parsing performance

Question

I have a big csv file (25 mb) that represents a symmetric graph (about 18kX18k). While parsing it into an array of vectors, i have analyzed the code (with VS2012 ANALYZER) and it shows that the problem with the parsing efficiency (about 19 seconds total) occurs while reading each character (getline::basic_string::operator+=) as shown in the picture below: enter image description here

This leaves me frustrated, as with Java simple buffered line file reading and tokenizer i achieve it with less than half a second.

My code uses only STL library:

int allColumns = initFirstRow(file,secondRow);
// secondRow has initialized with one value
int column = 1; // dont forget, first column is 0
VertexSet* rows = new VertexSet[allColumns];
rows[1] = secondRow;
string vertexString;
long double vertexDouble;
for (int row = 1; row < allColumns; row ++){
    // dont do the last row
    for (; column < allColumns; column++){
        //dont do the last column
        getline(file,vertexString,','); 
        vertexDouble = stold(vertexString);
        if (vertexDouble > _TH){
            rows[row].add(column);
        }
    }
    // do the last in the column
    getline(file,vertexString);
    vertexDouble = stold(vertexString);
    if (vertexDouble > _TH){
        rows[row].add(++column);
    }
    column = 0;
}
initLastRow(file,rows[allColumns-1],allColumns);

init first and last row basically does the same thing as the loop above, but initFirstRow also counts the number of columns.

VertexSet is basically a vector of indexes (int). Each vertex read (separated by ',') goes no more than 7 characters length long (values are between -1 and 1).

+1 for asking an optimisation/performance question based on actual profiling results, rather than speculation! — Oliver Charlesworth, Dec 28 '13 at 14:03
What optimization options are you using? (none => your main problem). — Dietmar Kühl, Dec 28 '13 at 14:19
Try using the `std::string::reserve` function to reduce the amount of reallocations. — user123, Dec 28 '13 at 14:21
@fatsokol: I'm a UNIX guy. I would use `-O2` or `-O3`. You seem to work with Visual Studio and I'm blissfully ignorant about Visual Studio. You might want to try a "release" build. However, compiling code using IOStreams without optimization is typically slow (in general, C++ code without optimization tends to be slower). — Dietmar Kühl, Dec 28 '13 at 14:35
Reading each line into an intermediate string may not be necessary. — Lightness Races in Orbit, Dec 28 '13 at 14:55
Looks like reallocation overkill ... you are probably adding (allocate+copy+delete) data to some table or string on each line or csv value. try to initialize size off destination tables to safe value (equal or bigger than needed). Another option is when reach actual size then double it ... to avoid too much reallocations. — Spektre, Dec 28 '13 at 15:42
@fatsokol: Use /Ox or choose Release Build.. Just wondering - why is column initialized to 1 and then set to 0? You can also read the double directly from the file rather than to an intermediate stream. Also as others have pointed out using reserve to avoid reallocations on all your vectors/strings will help. Let us know what the number is as-is with optimization on... — Guy Sirton, Dec 28 '13 at 19:12
@DietmarKühl I've made "release" instead of "debug", and the running time is now reduced from 18+ seconds to 1+ seconds. — squeezy, Dec 29 '13 at 06:56
@GuySirton iteration starts with column = 1 because I've already read the first value of the second row with the `initFirstRow`. Adding reserve did not help much. Can you specify (link to a reference or on comment) how may I read strait doubles from a file? — squeezy, Dec 29 '13 at 07:00
What function is calling `std::string::operator+=`? I don't see `vertexString` being appended anywhere in the code you posted. — Terrance Kennedy, Dec 31 '13 at 20:27
As Mohammad said, the problem is most likely reallocation of the string. When you append to string it might need to realloc). Anyways, I countered similar problems where std::string::reserve wasn't enough. But when I used std::string::resize I got the real boost. Also try to tokenize first by '\n' and only then by ','. This will cause you to read bigger portions at once. — buc030, Jan 04 '14 at 11:27
One minor point: I think you mean 1.8k x 1.8k. 18k x 18k x ~8 characters per item would give around 2.4 gigabytes of data (in which case, reading the data in 18 seconds would be *quite* respectable). — Jerry Coffin, Jan 07 '14 at 20:26

score 5 · Answer 1 · answered Jan 07 '14 at 20:36

At 25 megabytes, I'm going to guess that your file is machine generated. As such, you (probably) don't need to worry about things like verifying the format (e.g., that every comma is in place).

Given the shape of the file (i.e., each line is quite long) you probably won't impose a lot of overhead by putting each line into a stringstream to parse out the numbers.

Based on those two facts, I'd at least consider writing a ctype facet that treats commas as whitespace, then imbuing the stringstream with a locale using that facet to make it easy to parse out the numbers. Overall code length would be a little greater, but each part of the code would end up pretty simple:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <time.h>
#include <stdlib.h>
#include <locale>
#include <sstream>
#include <algorithm>
#include <iterator>

class my_ctype : public std::ctype<char> {
    std::vector<mask> my_table;
public:
    my_ctype(size_t refs=0):
        my_table(table_size),
        std::ctype<char>(my_table.data(), false, refs) 
    {
        std::copy_n(classic_table(), table_size, my_table.data());
        my_table[',']=(mask)space;
    }
};

template <class T>
class converter {
    std::stringstream buffer;
    my_ctype *m;
    std::locale l;
public:
    converter() : m(new my_ctype), l(std::locale::classic(), m) { buffer.imbue(l); }

    std::vector<T> operator()(std::string const &in) {
        buffer.clear();
        buffer<<in;
        return std::vector<T> {std::istream_iterator<T>(buffer),
            std::istream_iterator<T>()};        
    }
};

int main() {
    std::ifstream in("somefile.csv");
    std::vector<std::vector<double>> numbers;

    std::string line;
    converter<double> cvt;

    clock_t start=clock();
    while (std::getline(in, line))
        numbers.push_back(cvt(line));
    clock_t stop=clock();
    std::cout<<double(stop-start)/CLOCKS_PER_SEC << " seconds\n";
}

To test this, I generated an 1.8K x 1.8K CSV file of pseudo-random doubles like this:

#include <iostream>
#include <stdlib.h>

int main() {
    for (int i=0; i<1800; i++) {
        for (int j=0; j<1800; j++)
            std::cout<<rand()/double(RAND_MAX)<<",";
        std::cout << "\n";
    }
}

This produced a file around 27 megabytes. After compiling the reading/parsing code with gcc (g++ -O2 trash9.cpp), a quick test on my laptop showed it running in about 0.18 to 0.19 seconds. It never seems to use (even close to) all of one CPU core, indicating that it's I/O bound, so on a desktop/server machine (with a faster hard drive) I'd expect it to run faster still.

dmr195 · Answer 2 · 2017-08-15T11:35:03.700

The inefficiency here is in Microsoft's implementation of std::getline, which is being used in two places in the code. The key problems with it are:

It reads from the stream one character at a time
It appends to the string one character at a time

The profile in the original post shows that the second of these problems is the biggest issue in this case.

I wrote more about the inefficiency of std::getline here.

GNU's implementation of std::getline, i.e. the version in libstdc++, is much better.

Sadly, if you want your program to be fast and you build it with Visual C++ you'll have to use lower level functions than std::getline.

score 1 · Answer 3 · answered Jan 02 '14 at 13:11

The debug Runtime Library in VS is very slow because it does a lot of debug checks (for out of bound accesses and things like that) and calls lots of very small functions that are not inlined when you compile in Debug.

Running your program in release should remove all these overheads.

My bet on the next bottleneck is string allocation.

score 0 · Answer 4 · answered Dec 28 '13 at 23:45

0

I would try read bigger chunks of memory at once and then parse it all. Like.. read full line. and then parse this line using pointers and specialized functions.

answered Dec 28 '13 at 23:45

antoshka

1
1

In a sense that's what should be happening when you use the library functions. I agree you can probably optimize further with this approach but a solid standard library based solution (with optimizations on) should be quite performant... – Guy Sirton Dec 29 '13 at 02:24

score 0 · Answer 5 · answered Feb 04 '15 at 00:14

Hmm good answer here. Took me a while but I had the same problem. After this fix my write and process time went from 38 sec to 6 sec. Here's what I did.

First get data using boost mmap. Then you can use boost thread to make processing faster on the const char* that boost mmap returns. Something like this: (the multithreading is different depending on your implementation so I excluded that part)

#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/thread/thread.hpp>
#include <boost/lockfree/queue.hpp>

foo(string path)
{
    boost::iostreams::mapped_file mmap(path,boost::iostreams::mapped_file::readonly);
    auto chars = mmap.const_data(); // set data to char array
    auto eofile = chars + mmap.size(); // used to detect end of file
    string next = ""; // used to read in chars
    vector<double> data; // store the data
    for (; chars && chars != eofile; chars++) {
        if (chars[0] == ',' || chars[0] == '\n') { // end of value
            data.push_back(atof(next.c_str())); // add value
            next = ""; // clear
        }
        else
            next += chars[0]; // add to read string
    }
}

even better is use a binary file though. – Mar 17 '15 at 22:48 — , Mar 17 '15 at 22:48

Big csv file c++ parsing performance

5 Answers5