12

I need to read huge 35G file from disc line by line in C++. Currently I do it the following way:

ifstream infile("myfile.txt");
string line;
while (true) {
    if (!getline(infile, line)) break;
    long linepos = infile.tellg();
    process(line,linepos);
}

But it gives me about 2MB/sec performance, though file manager copies the file with 100Mb/s speed. I guess that getline() is not doing buffering correctly. Please propose some sort of buffered line-by-line reading approach.

UPD: process() is not a bottleneck, code without process() works with the same speed.

Stepan Yakovenko
  • 8,670
  • 28
  • 113
  • 206

3 Answers3

17

You won't get anywhere close to line speed with the standard IO streams. Buffering or not, pretty much ANY parsing will kill your speed by orders of magnitude. I did experiments on datafiles composed of two ints and a double per line (Ivy Bridge chip, SSD):

  • IO streams in various combinations: ~10 MB/s. Pure parsing (f >> i1 >> i2 >> d) is faster than a getline into a string followed by a sstringstream parse.
  • C file operations like fscanf get about 40 MB/s.
  • getline with no parsing: 180 MB/s.
  • fread: 500-800 MB/s (depending on whether or not the file was cached by the OS).

I/O is not the bottleneck, parsing is. In other words, your process is likely your slow point.

So I wrote a parallel parser. It's composed of tasks (using a TBB pipeline):

  1. fread large chunks (one such task at a time)
  2. re-arrange chunks such that a line is not split between chunks (one such task at a time)
  3. parse chunk (many such tasks)

I can have unlimited parsing tasks because my data is unordered anyway. If yours isn't then this might not be worth it to you. This approach gets me about 100 MB/s on an 4-core IvyBridge chip.

Adam
  • 16,808
  • 7
  • 52
  • 98
  • `ifstream.read` should have the same performance as `fread` (it does no parsing either); using it probably requires fewer changes to existing code – anatolyg Jul 20 '14 at 15:24
  • Was the file cached for your `getline` test? I see 680-800MB/sec for the asker's code (with an empty `process()`) and 1GB/sec without the tellg. (gcc-4.6.3, `-O0`) – user2313838 Jul 20 '14 at 15:48
  • @user2313838 yes it was cached. My code also looks a lot like the asker's. – Adam Jul 20 '14 at 19:14
  • MacOS/SSD/CoreI7(2xPhysical): `ifstream iff("out.mp"); while(iff.good()) { getline(iff, line); // get line from file }` 16727 milliseconds, file size: 371M – Arthur Kushman Aug 02 '17 at 10:07
  • @ArthurKushman that's only 22 MB/s, you should be able to get much more than that. – Adam Aug 02 '17 at 18:06
  • What does this mean: *"You won't get anywhere close to line speed with the standard IO streams"*? Even if we remove the `process` in OP's original code, it would still be slow. – starriet Aug 27 '22 at 04:44
  • @starriet it means exactly what you say in your second sentence. Line speed is what the storage device is capable of reading at, in this example around 500 MB/s. IO streams can only parse at around 10 MB/s, hence the parsing is the bottleneck (by a lot). – Adam Sep 08 '22 at 02:38
  • @Adam Thanks for the comment. Then I think this part of your answer *"your `process` is likely your slow point"* should be changed to something like *"`getline` is your slow point"*, no? – starriet Sep 08 '22 at 03:12
  • @starriet no, assuming OP's machine was somewhat similar to my laptop, their getline() loop by itself can achieve around 180 MB/s. The complete code was achieving only 2 MB/s, so getline isn't the main bottleneck. The only piece left is `process`. – Adam Sep 08 '22 at 03:54
  • @Adam Gotcha. (although OP said the speed was the same even after removing the `process` ...I don't know why). So, anyways, if we compare only between I/O and getline, then getline would be the bottleneck. Let me know if I misunderstood something. Thanks for your clarification. – starriet Sep 08 '22 at 04:04
4

I've translated my own buffering code from my java project and it does what I need. I had to put defines to overcome problems with M$VC 2010 compiler tellg, that always gives wrong negative values on huge files. This algorithm gives desired speed ~100MB/s, though it does some usless new[].

void readFileFast(ifstream &file, void(*lineHandler)(char*str, int length, __int64 absPos)){
        int BUF_SIZE = 40000;
        file.seekg(0,ios::end);
        ifstream::pos_type p = file.tellg();
#ifdef WIN32
        __int64 fileSize = *(__int64*)(((char*)&p) +8);
#else
        __int64 fileSize = p;
#endif
        file.seekg(0,ios::beg);
        BUF_SIZE = min(BUF_SIZE, fileSize);
        char* buf = new char[BUF_SIZE];
        int bufLength = BUF_SIZE;
        file.read(buf, bufLength);

        int strEnd = -1;
        int strStart;
        __int64 bufPosInFile = 0;
        while (bufLength > 0) {
            int i = strEnd + 1;
            strStart = strEnd;
            strEnd = -1;
            for (; i < bufLength && i + bufPosInFile < fileSize; i++) {
                if (buf[i] == '\n') {
                    strEnd = i;
                    break;
                }
            }

            if (strEnd == -1) { // scroll buffer
                if (strStart == -1) {
                    lineHandler(buf + strStart + 1, bufLength, bufPosInFile + strStart + 1);
                    bufPosInFile += bufLength;
                    bufLength = min(bufLength, fileSize - bufPosInFile);
                    delete[]buf;
                    buf = new char[bufLength];
                    file.read(buf, bufLength);
                } else {
                    int movedLength = bufLength - strStart - 1;
                    memmove(buf,buf+strStart+1,movedLength);
                    bufPosInFile += strStart + 1;
                    int readSize = min(bufLength - movedLength, fileSize - bufPosInFile - movedLength);

                    if (readSize != 0)
                        file.read(buf + movedLength, readSize);
                    if (movedLength + readSize < bufLength) {
                        char *tmpbuf = new char[movedLength + readSize];
                        memmove(tmpbuf,buf,movedLength+readSize);
                        delete[]buf;
                        buf = tmpbuf;
                        bufLength = movedLength + readSize;
                    }
                    strEnd = -1;
                }
            } else {
                lineHandler(buf+ strStart + 1, strEnd - strStart, bufPosInFile + strStart + 1);
            }
        }
        lineHandler(0, 0, 0);//eof
}

void lineHandler(char*buf, int l, __int64 pos){
    if(buf==0) return;
    string s = string(buf, l);
    printf(s.c_str());
}

void loadFile(){
    ifstream infile("file");
    readFileFast(infile,lineHandler);
}
Stepan Yakovenko
  • 8,670
  • 28
  • 113
  • 206
0

Use a line parser or write the same. here is a sample in the sourceforge http://tclap.sourceforge.net/ and put in a buffer if necessary.

prs_31117
  • 1
  • 1