0

I'm working with C++, ifstream, and text files. I am looking for the position of the end of each line, because I need read n characters from the end of the line.

Currently, I am reading every byte and testing if it corresponds to the Unix newline character (LF).

Unfortunately, the input is usually long text and my method isn't fast.

Is there any faster way?

user7116
  • 63,008
  • 17
  • 141
  • 172
Nanik
  • 603
  • 1
  • 12
  • 19

6 Answers6

6

If you are a looking for raw speed, I'd memory map the file and use something like strchr to find the newline;

p = strchr(line_start, '\n');

then so long as p isn't NULL or the first character in the memory region, you can just use p[-1] to read the character before the newline.

NOTE: if the file could possibly contain '\0' characters, then you should use memchr. In fact, this may be desirable regardless since it lets you specify the size of the buffer (the memory region).

Evan Teran
  • 87,561
  • 32
  • 179
  • 238
2

I'm working with C++, ifstream, and text files. I am looking for the position of the end of each line, because I need read n characters from the end of the line.

I'll focus on your requirement, reading 'n' characters from the end of the line, rather than your question:

// Untested.
std::string s;
while(std::getline(std::cin, s)) {
    if(s.size() > n) s.erase(s.begin(), s.end()-n);
    // s is the last 'n' chars of the line
    std::cout << "Last N chars: " << s << "\n";
}
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
1

You could take a look at the getline function in std::string. Try reading an entire line at a time and then read characters from the end of the string.

As usual with performance issues, the real trick is to run your code through a profiler to see where it's spending its time. There's often a very real difference between "Fastest" and "Fast Enough."

Michael Kristofik
  • 34,290
  • 15
  • 75
  • 125
1

There is no easier way to get to the end of the line marker, but you could save some time by storing what you read as you read your data. The you would not need to go back, and your loop will be very fast.

Create a character array of size n, and use it as a circular buffer: when you get to the end of the array, just circle back to its beginning. Store the character in the next position of your circular buffer.

When you detect '\n', your buffer contains the n prior characters, only slightly out of order: the prefix starts at your buffer pointer and goes to the end of the buffer, and the suffix starts at zero and goes to your buffer pointer minus one.

Here is an example of how you can make it work (assuming n == 20):

int main()
{
    ifstream fs("c:\\temp\\a.txt");
    char buf[20];
    int bp = 0;
    bool circular = false;
    while (fs.good()) {
        char ch = fs.get();
        if (ch != '\n') {
            buf[bp] = ch;
            bp = (bp+1) % 20;
            circular |= !bp;
        } else {
            string s;
            if (circular) {
                s = string(buf+bp, buf+20) + string(buf, buf+bp);
            } else {
                s = string(buf, buf+bp);
            }
            cerr << s << endl;
            circular = false;
            bp = 0;
        }
    }
    return 0;
}
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
0

The quick & dirty way is something along these lines:

ifs.seekg( 0, std::ifstream::end );
std::string buffer( ifs.tellg(), '\0' );
ifs.seekg( 0, std::ifstream::beg );
ifs.read( &buffer[0], buffer.size() );

Then work on buffer instead. This will probably net you all the speed up you need ( many many orders of magnitude in my experience ). If you want to be able to handle arbitrarily large files you need to modify the logic a bit ( search in chunks instead ).

Ylisar
  • 4,293
  • 21
  • 27
0

Whatever you do, you'll still end-up searching linearly through the file. You might be searching faster, but it will still be a linear search.

The real solution is to change the format of the file, so indexes of "interesting" characters are written near the beginning of the file. When the time comes to read it, you can completely skip the "uninteresting" parts of the file.

If that's not possible, you might be able to generate a separate "index" file. This won't save you from having to perform the linear search once, but will save you from having to do it repeatedly on the same file. This of course matters only if you're gonna process the same file more than once.

BTW, even the linear scan should be pretty fast. You should be I/O bound more than anything. How large are your files and what do you mean by "my method isn't fast"?

Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167