3

I am parsing a very large file of records (one per line, each of varying length), and I'd like to keep track of the number of bytes I've read in the file so that I may recover in the event of a failure.

I wrote the following:

using (TextReader myTextReader = CreateTextReader())
{
    string record = myTextReader.ReadLine();
    bytesRead += record.Length;
    ParseRecord(record);
}

However this doesn't work since ReadLine() strips any CR/LF characters in the line. Furthermore, a line may be terminated by either CR, LF, or CRLF characters, which means I can't just add 1 to bytesRead.

Is there an easy way to get the actual line length, or do I write my own ReadLine() method in terms of the granular Read() operations?

Steve Guidi
  • 19,700
  • 9
  • 74
  • 90

4 Answers4

2

Getting the current position of the underlying stream won't help, since the StreamReader will buffer data read from the stream.

Essentially you can't do this without writing your own StreamReader. But do you really need to?

I would simply count the number of lines read.

Of course, this means that to position to a specific line you will need to read N lines rather than simply seeking to an offset, but what's wrong with that? Have you determined that performance will be unacceptable?

Joe
  • 122,218
  • 32
  • 205
  • 338
1

A TextReader reads strings, which are characters, which [depending on the encoding] isn't equal to bytes.

How about just storing number of lines read, and just skip that many lines when recovering? I guess that it's all about not processing those line, not necessarily avoiding to read them from the stream.

sisve
  • 19,501
  • 3
  • 53
  • 95
  • In my case, I can assume that the file I am reading contains single-byte ASCII characters. Also, while I can store the line number, I was hoping to seek forward in the stream, avoiding having to read each line that I already parsed (lines are not fixed length). – Steve Guidi Jun 03 '10 at 06:41
1

If you are reading a string, you can use regular expression matches and count the number of characters.

        var regex = new Regex("^(.*)$", RegexOptions.Compiled | RegexOptions.Multiline);
        var matches = regex.Matches(text);
        var count = matches.Count;
        for (var matchIndex = 0; matchIndex < count; ++matchIndex)
        {
            var match = matches[matchIndex];
            var group = match.Groups[1];
            var value = group.Captures[0].Value;
            Console.WriteLine($"Line {matchIndex + 1} (pos={match.Index}): {value}");
        }
0

Come to think of it, I can use a StreamReader and get the current position of the underlying stream as follows.

using (StreamReader myTextReader = CreateStreamReader())
{
    stringRecord = myTextReader.ReadLine();
    bytesRead += myTextReader.BaseStream.Position;
    ParseRecord(record);
    // ...
}
Steve Guidi
  • 19,700
  • 9
  • 74
  • 90
  • 2
    That only works if the underlying stream supports seeking, which probably works in your case, but I should point out that this method will not work for every case. – Dave Van den Eynde Jun 03 '10 at 06:14
  • 2
    There may also be a problem if the StreamReader is buffered where the BaseStream would advance in chunks. – sisve Jun 03 '10 at 06:21
  • This definitely does not work since for performance reasons TextReader reads the base streams in blocks of 4096 bytes instead of byte-by-byte. This is actually the same as what Simon said. – Jecho Jekov Aug 01 '13 at 20:26