0

I have an extremely huge .csv file (with no headers) and I want to bulk insert it into SQLite database using C++.

I found this algorithm to be the fastest I need.

So, I have this piece of code:

void readFileFast(ifstream &file, void(*lineHandler)(char*str, int length, __int64 absPos)){
        int BUF_SIZE = 40000;
        file.seekg(0,ios::end);
        ifstream::pos_type p = file.tellg();
#ifdef WIN32
        __int64 fileSize = *(__int64*)(((char*)&p) +8);
#else
        __int64 fileSize = p;
#endif
        file.seekg(0,ios::beg);
        BUF_SIZE = min(BUF_SIZE, fileSize);
        char* buf = new char[BUF_SIZE];
        int bufLength = BUF_SIZE;
        file.read(buf, bufLength);

        int strEnd = -1;
        int strStart;
        __int64 bufPosInFile = 0;
        while (bufLength > 0) {
            int i = strEnd + 1;
            strStart = strEnd;
            strEnd = -1;
            for (; i < bufLength && i + bufPosInFile < fileSize; i++) {
                if (buf[i] == '\n') {
                    strEnd = i;
                    break;
                }
            }

            if (strEnd == -1) { // scroll buffer
                if (strStart == -1) {
                    lineHandler(buf + strStart + 1, bufLength, bufPosInFile + strStart + 1);
                    bufPosInFile += bufLength;
                    bufLength = min(bufLength, fileSize - bufPosInFile);
                    delete[]buf;
                    buf = new char[bufLength];
                    file.read(buf, bufLength);
                } else {
                    int movedLength = bufLength - strStart - 1;
                    memmove(buf,buf+strStart+1,movedLength);
                    bufPosInFile += strStart + 1;
                    int readSize = min(bufLength - movedLength, fileSize - bufPosInFile - movedLength);

                    if (readSize != 0)
                        file.read(buf + movedLength, readSize);
                    if (movedLength + readSize < bufLength) {
                        char *tmpbuf = new char[movedLength + readSize];
                        memmove(tmpbuf,buf,movedLength+readSize);
                        delete[]buf;
                        buf = tmpbuf;
                        bufLength = movedLength + readSize;
                    }
                    strEnd = -1;
                }
            } else {
                lineHandler(buf+ strStart + 1, strEnd - strStart, bufPosInFile + strStart + 1);
            }
        }
        lineHandler(0, 0, 0);//eof
}

void lineHandler(char*buf, int l, __int64 pos){
    if(buf==0) return;
    string s = string(buf, l);
    printf(s.c_str());
}

void loadFile(){
    ifstream infile("file");
    readFileFast(infile,lineHandler);
}

And I want to first output first let's say 100.000 complete full lines (not half line at the end of each chunk) or so in order to bulk insert them after into my SQLite database file.

But how to retrieve them?

I tried this:

int main() {
    ifstream ifile("./data.txt", std::ifstream::binary);
    if (ifile.good())
    {
        while (true)
        {
            readFileFast(ifile, lineHandler);
            cout<<lineHandler;
            if(!ifile) break;
            cout<<"------------------------------------------"<<endl;

        }
        // close file
        ifile.close();
    }else{
        cout<<"File not found!"<<endl;
    }

    return 0;
}

But it is not working as it prints 1 very time and I want 100.000 complete lines (not half line at the end) in order to be able to bulk insert them into SQLite.

Thank you in advance!

P.S. I found also this algorithm: https://cplusplus.com/forum/beginner/194071/

But it prints lines and almost every time the last line is just a half line but I need full complete lines in order to bulk insert them all at once into SQLite database.

YoYoYo
  • 439
  • 2
  • 11
  • It would be a neat trick if the code that seems to "print 1 very time" managed this feat by passing a function pointer to std::cout's `<<` overload, as the shown code attempted to do. It's a popular belief that the way to implement anything in C++ is to run a Google search to try to find some code on the intertubes that's vaguely described to do a similar task, then cobble it together with more code, cross your fingers, and hope it works but 9 times out of 10 it doesn't. Perhaps you can try to spend some time digging through the code and figure out why it "the last line is just a half line"? – Sam Varshavchik Jun 15 '22 at 10:56
  • @SamVarshavchik I tried to output lineHandler contant. If it was a vector then it would have lineHandler.data() and it can be printed with cout but it isn't. How to get those lines, please? – YoYoYo Jun 15 '22 at 10:59
  • Well, it is not a vector. It is a function pointer. This is because the found code is old, crufty C, and not C++, are you aware of that? Reading that old question, what a hoot! A Java person translated some code from Java into C++, but ended up with C, and rather crappy C. Typical. Anyway: do you understand how `readFileFast` actually, works (badly), and how it uses this function pointer parameter? It's purpose should be fairly clear, and if you understand how it works, then the shown approach should be easily adaptable to clean, modern, C++, at least. – Sam Varshavchik Jun 15 '22 at 11:03
  • ... and even if it was a vector of `char`s, there's no guarantee whatsoever that its `data()` member will provide something suitable for `std::cout`, for the obvious reasons that the `const char *` parameter to the specific `<<` overload you're referring to has an additional requirement that a `data()` on some random `std::vector` will not give you (automatically). Sounds like it might be beneficial to invest a little bit of time in some C++ fundamentals, here... – Sam Varshavchik Jun 15 '22 at 11:06
  • How huge is _"extremely huge"_? – Ted Lyngmo Jun 15 '22 at 11:15
  • @TedLyngmo 63Gb size csv file with no headers. – YoYoYo Jun 15 '22 at 13:53
  • 1
    @YoYoYo I see. I just realized that I made a comparison of plain `getline`s vs. `mmap` a couple of years ago ([here](https://stackoverflow.com/a/53634081/7582247)). If you have a Posix system, you could try the `mmap` part out to see if that speeds it up. The `Mmap` class in there is just a thin C++ wrapper around `mmap`/`munmap`. – Ted Lyngmo Jun 15 '22 at 15:19

0 Answers0