Efficient means of finding the position of the end of a line

Question

I'm working with C++, ifstream, and text files. I am looking for the position of the end of each line, because I need read n characters from the end of the line.

Currently, I am reading every byte and testing if it corresponds to the Unix newline character (LF).

Unfortunately, the input is usually long text and my method isn't fast.

Is there any faster way?

score 6 · Accepted Answer · answered Mar 08 '12 at 16:03

6

If you are a looking for raw speed, I'd memory map the file and use something like strchr to find the newline;

p = strchr(line_start, '\n');

then so long as p isn't NULL or the first character in the memory region, you can just use p[-1] to read the character before the newline.

NOTE: if the file could possibly contain '\0' characters, then you should use memchr. In fact, this may be desirable regardless since it lets you specify the size of the buffer (the memory region).

answered Mar 08 '12 at 16:03

Evan Teran

87,561
32
179
238

2

+1 You might need to check that there is a NUL character in the file in case the newline is not found, or use `strnchr` to ensure you don't go beyond the end of file – David Rodríguez - dribeas Mar 08 '12 at 16:06
@David: I was just in the middle of adding a note regarding that when you commented, good call :-). – Evan Teran Mar 08 '12 at 16:09

score 2 · Answer 2 · answered Mar 08 '12 at 16:25

I'm working with C++, ifstream, and text files. I am looking for the position of the end of each line, because I need read n characters from the end of the line.

I'll focus on your requirement, reading 'n' characters from the end of the line, rather than your question:

// Untested.
std::string s;
while(std::getline(std::cin, s)) {
    if(s.size() > n) s.erase(s.begin(), s.end()-n);
    // s is the last 'n' chars of the line
    std::cout << "Last N chars: " << s << "\n";
}

score 1 · Answer 3 · answered Mar 08 '12 at 16:02

You could take a look at the getline function in std::string. Try reading an entire line at a time and then read characters from the end of the string.

As usual with performance issues, the real trick is to run your code through a profiler to see where it's spending its time. There's often a very real difference between "Fastest" and "Fast Enough."

Sergey Kalinichenko · Answer 4 · 2012-03-08T16:20:29.920

There is no easier way to get to the end of the line marker, but you could save some time by storing what you read as you read your data. The you would not need to go back, and your loop will be very fast.

Create a character array of size n, and use it as a circular buffer: when you get to the end of the array, just circle back to its beginning. Store the character in the next position of your circular buffer.

When you detect '\n', your buffer contains the n prior characters, only slightly out of order: the prefix starts at your buffer pointer and goes to the end of the buffer, and the suffix starts at zero and goes to your buffer pointer minus one.

Here is an example of how you can make it work (assuming n == 20):

int main()
{
    ifstream fs("c:\\temp\\a.txt");
    char buf[20];
    int bp = 0;
    bool circular = false;
    while (fs.good()) {
        char ch = fs.get();
        if (ch != '\n') {
            buf[bp] = ch;
            bp = (bp+1) % 20;
            circular |= !bp;
        } else {
            string s;
            if (circular) {
                s = string(buf+bp, buf+20) + string(buf, buf+bp);
            } else {
                s = string(buf, buf+bp);
            }
            cerr << s << endl;
            circular = false;
            bp = 0;
        }
    }
    return 0;
}

score 0 · Answer 5 · answered Mar 08 '12 at 16:10

The quick & dirty way is something along these lines:

ifs.seekg( 0, std::ifstream::end );
std::string buffer( ifs.tellg(), '\0' );
ifs.seekg( 0, std::ifstream::beg );
ifs.read( &buffer[0], buffer.size() );

Then work on buffer instead. This will probably net you all the speed up you need ( many many orders of magnitude in my experience ). If you want to be able to handle arbitrarily large files you need to modify the logic a bit ( search in chunks instead ).

score 0 · Answer 6 · answered Mar 08 '12 at 16:13

Whatever you do, you'll still end-up searching linearly through the file. You might be searching faster, but it will still be a linear search.

The real solution is to change the format of the file, so indexes of "interesting" characters are written near the beginning of the file. When the time comes to read it, you can completely skip the "uninteresting" parts of the file.

If that's not possible, you might be able to generate a separate "index" file. This won't save you from having to perform the linear search once, but will save you from having to do it repeatedly on the same file. This of course matters only if you're gonna process the same file more than once.

BTW, even the linear scan should be pretty fast. You should be I/O bound more than anything. How large are your files and what do you mean by "my method isn't fast"?

Efficient means of finding the position of the end of a line

6 Answers6