0

I have to log huge amount of data in a CSV file with each row having 5 elements. I have used a large buffer to store the rows and then flush it in one shot using fwrite(...) when it gets filled and repeat till needed. The following is a snippet of logging function:

void logInFile(int a, int b, int c, int d, int e)
{    
    sprintf(rowInLog,"%d,%d,%d,%d,%d\n",a,b,c,d,e); 
    int bytesInRow = strlen(rowInLog);
    if(bytesInRow + bytesUsedInBuffer <= sizeOfBuffer)
    {
        strcat(buffer, rowInLog);
        bytesUsedInBuffer += bytesInRow;
    }
    else
    {
        printf("flushing file to disk\n");
        fwrite(buffer, bytesUsedInBuffer, 1, fp);
        memset(buffer, 0, sizeOfBuffer);
        bytesUsedInBuffer = 0;
        strcat(buffer, rowInLog);
        bytesUsedInBuffer += bytesInRow;
    }
}

But this is making the execution a lot slow and it is not because of flushing because the message "flushing file to disk" is not printed on screen. Without any call to this logging function the whole program executes in minutes but along with this, it did not complete even in 2 hours. Is there some other fundamental flaw?

Bit Manipulator
  • 248
  • 1
  • 2
  • 17
  • Think about how strings work. All your `strlen` and `strcat` calls do a *lot* of repeated work to compute something you already know. Pro tip: read the manual for `sprintf`. – Kerrek SB Dec 22 '13 at 19:53
  • 2
    Also, no need for `memset` - `buffer[0] = '\0'` is enough. – Nemanja Boric Dec 22 '13 at 19:54
  • How big is your `buffer` ? – Noam Rathaus Dec 22 '13 at 19:54
  • it is quite big - 1 GB – Bit Manipulator Dec 22 '13 at 20:05
  • @BitManipulator : Wow. Yep, I expect your code is very, _very_ slow with a 1GB buffer. See my O(N^2) comment below. – Joe Z Dec 22 '13 at 20:08
  • @BitManipulator: If you make the changes I suggest in my answer below, I would also recommend reducing the size of your buffer to something more like 1MB. That way the OS has a better chance of buffering the write and handling it in the background in parallel with your task. There will be a sweet spot for the "right" point to hand a buffer over from your application to the OS, and I think 1GB is well past that point. – Joe Z Dec 22 '13 at 20:11

2 Answers2

3

Your answer I suspect is right here:

if(bytesInRow + bytesUsedInBuffer <= sizeOfBuffer)
{
    strcat(buffer, rowInLog);  // <--- riiiight here.
    bytesUsedInBuffer += bytesInRow;
}

The strcat() function will scan the entire buffer to find the end ever time you call it. If buffer is large and mostly full, that could be quite slow. The behavior is roughly O(N2) in the size of buffer. Your performance will degrade rapidly as you increase the size of your buffer. That's pretty much the opposite of what you want from your buffer. (Edit: You mentioned in a comment that your buffer is 1GB. I would expect the above code to be very, very slow as that buffer fills.)

However, you already know exactly where the end is, and how many bytes to copy. So do this instead:

if(bytesInRow + bytesUsedInBuffer <= sizeOfBuffer)
{
    memcpy(buffer + bytesUsedInBuffer, rowInLog, bytesInRow + 1);
    bytesUsedInBuffer += bytesInRow;
}

Note that I had memcpy copy one extra byte, so that it puts the NUL terminator on the buffer, just in case you have any other strXXX functions laying around that operate on buffer. If you don't, you can safely remove the + 1 above.

A similar, less egregious error occurs in your else clause. You can replace that also with a memcpy:

    printf("flushing file to disk\n");
    fwrite(buffer, bytesUsedInBuffer, 1, fp);
    memcpy(buffer, rowInLog, bytesInRow + 1);
    bytesUsedInBuffer = bytesInRow;

You can also save a little bit of time by combining these statements:

sprintf(rowInLog,"%d,%d,%d,%d,%d\n",a,b,c,d,e); 
int bytesInRow = strlen(rowInLog);

The sprintf returns the length of the output string, so you could simply say:

int bytesInRow = sprintf(rowInLog,"%d,%d,%d,%d,%d\n",a,b,c,d,e); 

That wasn't the main performance issue in your code, but changing that will improve it further.


EDIT: An even better, alternate approach:

If you want to eliminate the memcpy() entirely, consider this alternate approach:

bytesUsedInBuffer += snprintf( buffer + bytesUsedInBuffer, maximumLineSize, 
                               "%d,%d,%d,%d,%d\n", a,b,c,d,e );

if (bytesUsedInBuffer >= sizeOfBuffer - maximumLineSize )
{
    fwrite(buffer, bytesUsedInBuffer, 1, fp);
    bytesUsedInBuffer = 0;
}

Set maximumLineSize to a reasonable value for your row of 5 integers, such as 60. (10 bytes for each integer including sign plus 5 bytes for commas and newline is 55, so 60 is a nice round number above that.)

Joe Z
  • 17,413
  • 3
  • 28
  • 39
1

You are computing the length of the entire string every time around! This means the entire and growing string needs to be shuffled through the processor. Doing so is roughly speaking a worst case scenario! You are much better off writing the string to a file once in a while. Additionally, you should keep track of the last write position and append the string right there:

size_t size = sprintf(rowInLog + rowInLogSize, "%d,%d,%d,%d,%d\n", a, b, c, d, e);
rowInLogSize += size;
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380