3

I need to read in all but the last x lines from a file to a streamreader in C#. What is the best way to do this?

Many Thanks!

  • 2
    Read from the end of the file for x newlines then read from the beginning of the file until that position. – M.Babcock Mar 09 '12 at 03:22
  • Is there some kind of uniformity to the records you plan to read from your file (common record length, anything other than ending in `\n`)? – M.Babcock Mar 09 '12 at 03:37

3 Answers3

4

If it's a large file, is it possible to just seek to the end of the file, and examine the bytes in reverse for the '\n' character? I am aware that \n and \r\n exists. I whipped up the following code and tested on a fairly trivial file. Can you try testing this on the files that you have? I know my solution looks long, but I think you'll find that it's faster than reading from the beginning and rewriting the whole file.

public static void Truncate(string file, int lines)
{
    using (FileStream fs = File.Open(file, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.None))
    {
        fs.Position = fs.Length;

        // \n \r\n (both uses \n for lines)
        const int BUFFER_SIZE = 2048;

        // Start at the end until # lines have been encountered, record the position, then truncate the file
        long currentPosition = fs.Position;
        int linesProcessed = 0;

        byte[] buffer = new byte[BUFFER_SIZE];
        while (linesProcessed < linesToTruncate && currentPosition > 0)
        {
            int bytesRead = FillBuffer(buffer, fs);

            // We now have a buffer containing the later contents of the file
            for (int i = bytesRead - 1; i >= 0; i--)
            {
                 currentPosition--;
                 if (buffer[i] == '\n')
                 {
                     linesProcessed++;
                     if (linesProcessed == linesToTruncate)
                         break;
                 }
            }
        }

        // Truncate the file
        fs.SetLength(currentPosition);
    }
}

private static int FillBuffer(byte[] buffer, FileStream fs)
{
    if (fs.Position == 0)
        return 0;

    int bytesRead = 0;
    int currentByteOffset = 0;

    // Calculate how many bytes of the buffer can be filled (remember that we're going in reverse)
    long expectedBytesToRead = (fs.Position < buffer.Length) ? fs.Position : buffer.Length;
    fs.Position -= expectedBytesToRead;

    while (bytesRead < expectedBytesToRead)
    {
        bytesRead += fs.Read(buffer, currentByteOffset, buffer.Length - bytesRead);
        currentByteOffset += bytesRead;
    }

    // We have to reset the position again because we moved the reader forward;
    fs.Position -= bytesRead;
    return bytesRead;
}

Since you are only planning on deleting the end of the file, it seems wasteful to rewrite everything, especially if it's a large file and small N. Of course, one can make the argument that if someone wanted to eliminate all lines, then going from the beginning to the end is more efficient.

Tung
  • 5,334
  • 1
  • 34
  • 41
3

You don't really read INTO a StreamReader. In fact, for the pattern you're asking for you don't need the StreamReader at all. System.IO.File has the useful static method 'ReadLines' that you can leverage instead:

IEnumerable<string> allBut = File.ReadLines(path).Reverse().Skip(5).Reverse();

The previous flawed version, back in response to the comment thread

List<string> allLines = File.ReadLines(path).ToList();
IEnumerable<string> allBut = allLines.Take(allLines.Count - 5);
xcud
  • 14,422
  • 3
  • 33
  • 29
  • You're proposing reading the entire file as a performant alternative (ReadLines.Count _will_ read the entire file)? – M.Babcock Mar 09 '12 at 03:34
  • 1
    You're right. I just ran it through a few timed tests. The second method is consistently faster. Thanks. Updating my answer to remove the first option. – xcud Mar 09 '12 at 03:47
  • +1 - Though it could become more performant by processing it yourself byte by byte in reverse while looking for `Chr(13)` - 1 byte but this should still be faster than the alternatives. – M.Babcock Mar 09 '12 at 03:58
  • I just spent the past half hour wrapping up one method after another inside of a System.Diagnostics.Stopwatch block. Nothing has come anywhere close to the performance of this one-liner. I'm pleasantly surprised. – xcud Mar 09 '12 at 04:23
  • 1
    That would depend on the file length. If the file does not fit in your free memory, you are a toast. – Andrew Savinykh Mar 09 '12 at 04:31
  • There's a much simpler way to do this, no need to reverse the sequence and then reverse it back. In fact, it's *similar* to what you had in a prior version of the answer (although that one, too, was flawed). Use File.ReadAllLines to get an array. Then *take* the *length - n* of the array. – Anthony Pegram Mar 09 '12 at 04:44
  • The flawed part of the prior version was using the method that returned IEnumerable, so that Count had to walk the sequence in its entirety multiple times. The next flawed part was Math.Min (which used Count twice). If I have positive integer y, when is x - y going to be greater than x? Barring an integer underflow, *never*. So that particular piece of logic was overkill! You could have simply taken Count() - 5, you would have been better off. – Anthony Pegram Mar 09 '12 at 04:47
  • 1
    We need to know from the noticably absent asker whether he intends to read the file (minus the last 5 lines) INTO something in which case we're already saddled with the memory requirement and this is a darn fine answer in that scenario OR he wants to process the file line by line (except your 5 ships, of course), in which case M.Babcock's original comment is the most correct solution; set a marker by reading backwards then process front to back until you hit the marker. – xcud Mar 09 '12 at 04:49
  • 1
    I rather doubt you are running an effective test. Without questioning whether or not you are running in release mode without a debugger attached, etc., I will first start by asking if you are actually materializing the query results? For example, `allBut = File.ReadLines(path).Reverse().Skip(5).Reverse();` *doesn't do anything* until you execute it (ie., *iterating over it*). – Anthony Pegram Mar 09 '12 at 05:15
  • That's reasonable criticism. What do you suggest as a test harness? – xcud Mar 09 '12 at 05:25
  • Give a try to http://pastebin.com/sLEVdwyj . Play with it, improve it, I am by no means an expert on performance testing. My quick unexpert testing has revealed "my" version to be faster for the given number of iterations I run and the file size. You may find differently. And tossing performance aside for a moment, which version do you find easier to understand? Granted, neither is altogether *hard*, but just at a glance? – Anthony Pegram Mar 09 '12 at 05:31
  • Apologies for not responding earlier, for some reason I didn't get any notifications of any responses. My goal is to read the file (a csv file) in minus the last x lines and then load that data in to a database - after validation and all that malarkey. I think the process of reading backwards to find marker then reading forward will be the way to go but really appreciate all the inputs here. – John Griffiths Mar 09 '12 at 05:44
3

Since you are referring to lines in a file, I'm assuming it's a text file. If you just want to get the lines you can read them into an array of strings like so:

string[] lines = File.ReadAllLines(@"C:\test.txt");

Or if you really need to work with StreamReaders:

using (StreamReader reader = new StreamReader(@"C:\test.txt"))
        {
            while (!reader.EndOfStream)
            {
                Console.WriteLine(reader.ReadLine());
            }
        }
BryanJ
  • 8,485
  • 1
  • 42
  • 61
  • The use of `StreamReader` should have been enough to assume the OP was talking about text. – M.Babcock Mar 09 '12 at 03:32
  • I guess I wasn't sure if the person asking the question knew they needed a StreamReader, or just knew they needed to read in a file and did a quick search online and saw StreamReader show up. But yes you are correct – BryanJ Mar 09 '12 at 04:37