Reading text files line by line, with exact offset/position reporting

Question

My simple requirement: Reading a huge (> a million) line test file (For this example assume it's a CSV of some sorts) and keeping a reference to the beginning of that line for faster lookup in the future (read a line, starting at X).

I tried the naive and easy way first, using a StreamWriter and accessing the underlying BaseStream.Position. Unfortunately that doesn't work as I intended:

Given a file containing the following

Foo
Bar
Baz
Bla
Fasel

and this very simple code

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = sr.BaseStream.Position;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos = sr.BaseStream.Position;
  }
}

the output is:

000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel

I can imagine that the stream is trying to be helpful/efficient and probably reads in (big) chunks whenever new data is necessary. For me this is bad..

The question, finally: Any way to get the (byte, char) offset while reading a file line by line without using a basic Stream and messing with \r \n \r\n and string encoding etc. manually? Not a big deal, really, I just don't like to build things that might exist already..

If you reflect out the System.IO.Stream class, the minimum buffer allowed is 128 bytes... not sure if this will help, but on a longer file when I tried this, that was the shortest position I could get. — Nathan Wheeler, Apr 07 '10 at 17:16

score 13 · Accepted Answer · answered Apr 07 '10 at 16:58

13

You could create a TextReader wrapper, which would track the current position in the base TextReader :

public class TrackingTextReader : TextReader
{
    private TextReader _baseReader;
    private int _position;

    public TrackingTextReader(TextReader baseReader)
    {
        _baseReader = baseReader;
    }

    public override int Read()
    {
        _position++;
        return _baseReader.Read();
    }

    public override int Peek()
    {
        return _baseReader.Peek();
    }

    public int Position
    {
        get { return _position; }
    }
}

You could then use it as follows :

string text = @"Foo
Bar
Baz
Bla
Fasel";

using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
    string line;
    while ((line = trackingReader.ReadLine()) != null)
    {
        Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
    }
}

answered Apr 07 '10 at 16:58

Thomas Levesque

286,951
70
623
758

Seems to work. That somehow seems so obvious now.. Thanks a lot. – Benjamin Podszun Apr 07 '10 at 17:06
1

This solution is fine as long as you want the character position, rather than the byte position. If the underlying file has a Byte Order Mark (BOM) it will offset, or if it uses multi-byte characters, the 1:1 correspondence between characters and bytes no longer holds. – Frederik Apr 27 '10 at 11:36
Agreed, only works for single byte encoded characters e.g. ASCII. If for instance your underlying file is Unicode, each character will be 2 or 4 byte encoded. The implementation above is working on a character stream, not a byte stream, so you will get character offsets which will not map onto the actual byte positions as each character can be 2 or 4 bytes. For example, the second character position will be reported as index 1, but the byte position will actually be index 2 or 4. If there is a BOM (Byte Order Mark) this will again add extra bytes to the true underlying byte position. – Tim Lloyd Apr 27 '10 at 11:52

score 5 · Answer 2 · edited Oct 23 '12 at 16:46

After searching, testing and do something crazy, there is my code to solve (I'm currently using this code in my product).

public sealed class TextFileReader : IDisposable
{

    FileStream _fileStream = null;
    BinaryReader _binReader = null;
    StreamReader _streamReader = null;
    List<string> _lines = null;
    long _length = -1;

    /// <summary>
    /// Initializes a new instance of the <see cref="TextFileReader"/> class with default encoding (UTF8).
    /// </summary>
    /// <param name="filePath">The path to text file.</param>
    public TextFileReader(string filePath) : this(filePath, Encoding.UTF8) { }

    /// <summary>
    /// Initializes a new instance of the <see cref="TextFileReader"/> class.
    /// </summary>
    /// <param name="filePath">The path to text file.</param>
    /// <param name="encoding">The encoding of text file.</param>
    public TextFileReader(string filePath, Encoding encoding)
    {
        if (!File.Exists(filePath))
            throw new FileNotFoundException("File (" + filePath + ") is not found.");

        _fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read);
        _length = _fileStream.Length;
        _binReader = new BinaryReader(_fileStream, encoding);
    }

    /// <summary>
    /// Reads a line of characters from the current stream at the current position and returns the data as a string.
    /// </summary>
    /// <returns>The next line from the input stream, or null if the end of the input stream is reached</returns>
    public string ReadLine()
    {
        if (_binReader.PeekChar() == -1)
            return null;

        string line = "";
        int nextChar = _binReader.Read();
        while (nextChar != -1)
        {
            char current = (char)nextChar;
            if (current.Equals('\n'))
                break;
            else if (current.Equals('\r'))
            {
                int pickChar = _binReader.PeekChar();
                if (pickChar != -1 && ((char)pickChar).Equals('\n'))
                    nextChar = _binReader.Read();
                break;
            }
            else
                line += current;
            nextChar = _binReader.Read();
        }
        return line;
    }

    /// <summary>
    /// Reads some lines of characters from the current stream at the current position and returns the data as a collection of string.
    /// </summary>
    /// <param name="totalLines">The total number of lines to read (set as 0 to read from current position to end of file).</param>
    /// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
    public List<string> ReadLines(int totalLines)
    {
        if (totalLines < 1 && this.Position == 0)
            return this.ReadAllLines();

        _lines = new List<string>();
        int counter = 0;
        string line = this.ReadLine();
        while (line != null)
        {
            _lines.Add(line);
            counter++;
            if (totalLines > 0 && counter >= totalLines)
                break;
            line = this.ReadLine();
        }
        return _lines;
    }

    /// <summary>
    /// Reads all lines of characters from the current stream (from the begin to end) and returns the data as a collection of string.
    /// </summary>
    /// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
    public List<string> ReadAllLines()
    {
        if (_streamReader == null)
            _streamReader = new StreamReader(_fileStream);
        _streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
        _lines = new List<string>();
        string line = _streamReader.ReadLine();
        while (line != null)
        {
            _lines.Add(line);
            line = _streamReader.ReadLine();
        }
        return _lines;
    }

    /// <summary>
    /// Gets the length of text file (in bytes).
    /// </summary>
    public long Length
    {
        get { return _length; }
    }

    /// <summary>
    /// Gets or sets the current reading position.
    /// </summary>
    public long Position
    {
        get
        {
            if (_binReader == null)
                return -1;
            else
                return _binReader.BaseStream.Position;
        }
        set
        {
            if (_binReader == null)
                return;
            else if (value >= this.Length)
                this.SetPosition(this.Length);
            else
                this.SetPosition(value);
        }
    }

    void SetPosition(long position)
    {
        _binReader.BaseStream.Seek(position, SeekOrigin.Begin);
    }

    /// <summary>
    /// Gets the lines after reading.
    /// </summary>
    public List<string> Lines
    {
        get
        {
            return _lines;
        }
    }

    /// <summary>
    /// Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.
    /// </summary>
    public void Dispose()
    {
        if (_binReader != null)
            _binReader.Close();
        if (_streamReader != null)
        {
            _streamReader.Close();
            _streamReader.Dispose();
        }
        if (_fileStream != null)
        {
            _fileStream.Close();
            _fileStream.Dispose();
        }
    }

    ~TextFileReader()
    {
        this.Dispose();
    }
}

Anton · Answer 3 · 2014-03-24T13:04:56.317

This is really tough issue. After very long and exhausting enumeration of different solutions in the internet (including solutions from this thread, thank you!) I had to create my own bicycle.

I had following requirements:

Performance - reading must be very fast, so reading one char at the time or using reflection are not acceptable, so buffering is required
Streaming - file can be huge, so it is not acceptable to read it to memory entirely
Tailing - file tailing should be available
Long lines - lines can be very long, so buffer can't be limited

Stable - single byte error was immediately visible during usage. Unfortunately for me, several implementations I found were with stability problems

public class OffsetStreamReader
{
    private const int InitialBufferSize = 4096;    
    private readonly char _bom;
    private readonly byte _end;
    private readonly Encoding _encoding;
    private readonly Stream _stream;
    private readonly bool _tail;

    private byte[] _buffer;
    private int _processedInBuffer;
    private int _informationInBuffer;

    public OffsetStreamReader(Stream stream, bool tail)
    {
        _buffer = new byte[InitialBufferSize];
        _processedInBuffer = InitialBufferSize;

        if (stream == null || !stream.CanRead)
            throw new ArgumentException("stream");

        _stream = stream;
        _tail = tail;
        _encoding = Encoding.UTF8;

        _bom = '\uFEFF';
        _end = _encoding.GetBytes(new [] {'\n'})[0];
    }

    public long Offset { get; private set; }

    public string ReadLine()
    {
        // Underlying stream closed
        if (!_stream.CanRead)
            return null;

        // EOF
        if (_processedInBuffer == _informationInBuffer)
        {
            if (_tail)
            {
                _processedInBuffer = _buffer.Length;
                _informationInBuffer = 0;
                ReadBuffer();
            }

            return null;
        }

        var lineEnd = Search(_buffer, _end, _processedInBuffer);
        var haveEnd = true;

        // File ended but no finalizing newline character
        if (lineEnd.HasValue == false && _informationInBuffer + _processedInBuffer < _buffer.Length)
        {
            if (_tail)
                return null;
            else
            {
                lineEnd = _informationInBuffer;
                haveEnd = false;
            }
        }

        // No end in current buffer
        if (!lineEnd.HasValue)
        {
            ReadBuffer();
            if (_informationInBuffer != 0)
                return ReadLine();

            return null;
        }

        var arr = new byte[lineEnd.Value - _processedInBuffer];
        Array.Copy(_buffer, _processedInBuffer, arr, 0, arr.Length);

        Offset = Offset + lineEnd.Value - _processedInBuffer + (haveEnd ? 1 : 0);
        _processedInBuffer = lineEnd.Value + (haveEnd ? 1 : 0);

        return _encoding.GetString(arr).TrimStart(_bom).TrimEnd('\r', '\n');
    }

    private void ReadBuffer()
    {
        var notProcessedPartLength = _buffer.Length - _processedInBuffer;

        // Extend buffer to be able to fit whole line to the buffer
        // Was     [NOT_PROCESSED]
        // Become  [NOT_PROCESSED        ]
        if (notProcessedPartLength == _buffer.Length)
        {
            var extendedBuffer = new byte[_buffer.Length + _buffer.Length/2];
            Array.Copy(_buffer, extendedBuffer, _buffer.Length);
            _buffer = extendedBuffer;
        }

        // Copy not processed information to the begining
        // Was    [PROCESSED NOT_PROCESSED]
        // Become [NOT_PROCESSED          ]
        Array.Copy(_buffer, (long) _processedInBuffer, _buffer, 0, notProcessedPartLength);

        // Read more information to the empty part of buffer
        // Was    [ NOT_PROCESSED                   ]
        // Become [ NOT_PROCESSED NEW_NOT_PROCESSED ]
        _informationInBuffer = notProcessedPartLength + _stream.Read(_buffer, notProcessedPartLength, _buffer.Length - notProcessedPartLength);

        _processedInBuffer = 0;
    }

    private int? Search(byte[] buffer, byte byteToSearch, int bufferOffset)
    {
        for (int i = bufferOffset; i < buffer.Length - 1; i++)
        {
            if (buffer[i] == byteToSearch)
                return i;
        }
        return null;
    }
}

I have a log file, which when read with offsetreader causes it to get into infinite loop... — rekna, Feb 17 '17 at 14:08

score 2 · Answer 4 · answered Jul 15 '12 at 11:20

Though Thomas Levesque's solution works well, here's mine. It uses reflection so it will be slower, but it's encoding-independent. Plus I added Seek extension too.

/// <summary>Useful <see cref="StreamReader"/> extentions.</summary>
public static class StreamReaderExtentions
{
    /// <summary>Gets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
    /// <remarks><para>This method is quite slow. It uses reflection to access private <see cref="StreamReader"/> fields. Don't use it too often.</para></remarks>
    /// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
    /// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
    /// <returns>The current position of this stream.</returns>
    public static long GetPosition(this StreamReader streamReader)
    {
        if (streamReader == null)
            throw new ArgumentNullException("streamReader");

        var charBuffer = (char[])streamReader.GetType().InvokeMember("charBuffer", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        var charPos = (int)streamReader.GetType().InvokeMember("charPos", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        var charLen = (int)streamReader.GetType().InvokeMember("charLen", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);

        var offsetLength = streamReader.CurrentEncoding.GetByteCount(charBuffer, charPos, charLen - charPos);

        return streamReader.BaseStream.Position - offsetLength;
    }

    /// <summary>Sets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
    /// <remarks>
    /// <para><see cref="StreamReader.BaseStream"/> should be seekable.</para>
    /// <para>This method is quite slow. It uses reflection and flushes the charBuffer of the <see cref="StreamReader.BaseStream"/>. Don't use it too often.</para>
    /// </remarks>
    /// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
    /// <param name="position">The point relative to origin from which to begin seeking.</param>
    /// <param name="origin">Specifies the beginning, the end, or the current position as a reference point for origin, using a value of type <see cref="SeekOrigin"/>. </param>
    /// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
    /// <exception cref="ArgumentException">Occurs when <see cref="StreamReader.BaseStream"/> is not seekable.</exception>
    /// <returns>The new position in the stream. This position can be different to the <see cref="position"/> because of the preamble.</returns>
    public static long Seek(this StreamReader streamReader, long position, SeekOrigin origin)
    {
        if (streamReader == null)
            throw new ArgumentNullException("streamReader");

        if (!streamReader.BaseStream.CanSeek)
            throw new ArgumentException("Underlying stream should be seekable.", "streamReader");

        var preamble = (byte[])streamReader.GetType().InvokeMember("_preamble", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        if (preamble.Length > 0 && position < preamble.Length) // preamble or BOM must be skipped
            position += preamble.Length;

        var newPosition = streamReader.BaseStream.Seek(position, origin); // seek
        streamReader.DiscardBufferedData(); // this updates the buffer

        return newPosition;
    }
}

score -1 · Answer 5 · answered Apr 07 '10 at 16:35

-1

Would this work:

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = 0;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos += line.Length;
  }
}

answered Apr 07 '10 at 16:35

Sani Huttunen

23,620
6
72
79

Unfortunately not, because I have to accept different types of newlines (think this \n, \r\n, \r) and the number would be skewed. This might work if I insist to have a _consistent_ newline separator (it could very well be mixed in practice) and if I probe it first, to know the real offset. So - I'm trying to avoid going down that route. – Benjamin Podszun Apr 07 '10 at 16:38
@Benjamin: Darn - I just posted a similar answer which explicitly relied on a consistent newline separator... – Jon Skeet Apr 07 '10 at 16:40
Then I think you'd be better off doing it manually with StreamReader.Read(). – Sani Huttunen Apr 07 '10 at 16:42
@Jon: Hehe. As I said: That _might_ be the way, instead of using a plain Stream - if these are the only two options I've to roll a dice and live with the consequences: Either the consistent separators (bad for files that were processed on more than one platform, copy/pasted in bad editors etc) or the Stream stuff (boring low level line parsing and string encoding mess, a lot of boiler plate code for a seemingly low return) – Benjamin Podszun Apr 07 '10 at 16:44
That wouldn't help much. I have to ditch the whole `StreamReader`. Even `Read()` on it leads to a block read on the underlying stream and moves the `BaseStream.Position` to 25 for my sample. After _one char_. – Benjamin Podszun Apr 07 '10 at 16:47

Reading text files line by line, with exact offset/position reporting

5 Answers5