0

I'm writing a file parser in a .NET application that reads the file with a StreamReader. The file to be parsed starts with a header that ends with "<eoh>". I want to either read or ignore everything from the start until that string. The actual data starts after that.

The file is not line based. Everything is spearated only by such marker strings. So I cannot use ReadLine.

How can I do that without reading one character at a time and implementing a state machine to recognise the marker work characters? I'm specifically looking for a method like StreamReader.SkipUntilAfter(string) or StreamReader.ReadUntil(string).

Oh, and this project is still using .NET 2.0, so LINQ is not desired here. Although I could probably resolve that if somebody suggests using it.

ygoe
  • 18,655
  • 23
  • 113
  • 210
  • If the file is line based you can use `File.ReadLines` and LINQ(f.e. `SkipWhile`, `TakeWhile`). – Tim Schmelter Oct 19 '14 at 19:53
  • You can use something like while((line=reader.ReadLine()) != null) { // ONLY READ AFTER line.Equals("") } – briba Oct 19 '14 at 19:54
  • Updated question: It's not line-based. The marker could occur in the middle of a line. Then I'd already have read part of the data. And StreamReader cannot seek back or seek anywhere. – ygoe Oct 19 '14 at 19:57
  • Indeed. I don't think `StreamReader` is the good candidate for the job. There is a reason why we use it with `ReadLines` most of the time. – Patrice Gahide Oct 19 '14 at 20:04
  • What other options are there? I need to continue reading characters based on the content. Specifically I'm reading ADIF v2 files: http://www.adif.org/ – ygoe Oct 19 '14 at 20:08
  • 1
    Can't you read the whole file at once and then process it? – brz Oct 19 '14 at 20:16
  • Skipping data doesn't make sense, you should reconsider the format. – toplel32 Oct 19 '14 at 20:49
  • 1
    Your request is very sensible. It is an arbitrary limitation that you can only read up to a *line break*. Why not something else? Fork the StreamReader source code. It is quite understandable. – usr Oct 19 '14 at 20:51

1 Answers1

1

TextReaders generally do already read just character by character. They use a buffer so that that's faster, but a buffer to the StreamReader isn't any different than just reading ahead and pulling only until the <eoh>. There will also be no better way to skip until after that header, for the same reason. The absolute best-case scenario would be a built-in function that simply visually abstracts the underlying code, so that isn't particularly useful.

In case you don't believe me for whatever reason, here's the source code.

Also, it's worth noting that you'll have to look character-by-character no matter what. Even if you had a way of pulling them into memory without doing so, comparing two strings is a character-by-character operation. So you wouldn't be saving anything.

Personally, I'd just go with something like this. It takes a TextReader and end-of-header string, and reads through the reader until it finds eoh. It then returns a bool for whether it found the marker or not.

public bool SkipUntilAfterHeader(TextReader reader, string eoh)
{
    int eohGuessIndex = 0;
    int next;

    while ((next = reader.Read()) != -1)
    {
        char c = (char)next;

        if (c == eoh[eohGuessIndex])
        {
            eohGuessIndex++;
            if (eohGuessIndex == eoh.Length)
            {
                return true;
            }
        }
        else
        {
            eohGuessIndex = 0;
        }
    }

    return false;
}

I'm not sure what .NET 2.0 had or didn't have, so I wrote some stuff from scratch that may or may not have to be. But performance shouldn't be affected by that. A nice aspect of this is that you could also easily add a StringBuilder with an out parameter that would pass off the header information, in case you did want that later on.

Then, usage is pretty simple.

public void ReadFile(string path)
{
    using (StreamReader reader = new StreamReader(path))
    {
        if (SkipUntilAfterHeader(reader, "<eoh>"))
        {
            // read file
        }
        else
        {
            // corrupt file
        }
    }
}

But, realistically, it might just be easier to read the whole file and return only the relevant part. It just depends on how important performance is, compared to readability.

And in classically bad form, note that I haven't tested--or even compiled--any of this. But it should be relatively easy to fix, even if it doesn't work.

Matthew Haugen
  • 12,916
  • 5
  • 38
  • 54
  • This will work but reading char-by-char is much more CPU intensive than what StreamReader does internally. – usr Oct 19 '14 at 20:52
  • @usr no, actually it isn't. Look at the Reference Source for its implementation of [`Read`](http://referencesource.microsoft.com/mscorlib/a.html#5d81175d2e6d320e) and its implementation for [`ReadLine`](http://referencesource.microsoft.com/mscorlib/a.html#a4ada5f765646068). `ReadLine` is effectively doing the same thing as I am, it's just a little more optimized. But nothing you couldn't do here. They both make use of the buffer internally, so it's mostly just memory reads. The `List<>` could easily be made better, but I elected for the more expandable option. – Matthew Haugen Oct 19 '14 at 20:57
  • 1
    Your hot loop is far less tight than StreamReader can do because it can use its internal buffer directly. Calling Read for every char is something you're going to notice in benchmarks. Also the list adding and removing adds overhead. I'd say yours is >=3x slower than the native version. – usr Oct 19 '14 at 21:00
  • 1
    @usr Fair enough. You've got me there. I guess this is more of just an example of what it should look like. I don't think there's any viable way of escaping the `Read` calls without reading the whole thing at once, which is what the OP was looking to avoid. The `List<>` can definitely be made a ton better with just an index that's persisted so comparisons only have to happen once. But this is a quick and dirty way of showing kind of what needs to happen. – Matthew Haugen Oct 19 '14 at 21:04
  • @usr Talked me into it. I realized that the efficient way was actually a fair bit simpler. It should be quite a bit better now. Nothing compared to what a native solution that has a built-in idea of what the end-of-header string is, but it's something. – Matthew Haugen Oct 19 '14 at 21:10
  • 1
    I think your solution is fine in principle. (Except that now it fail to match for "<".) – usr Oct 19 '14 at 21:11
  • @usr Yeah, that's definitely what it was intended to be--just a proof of concept to add onto the fact that a char-by-char comparison was probably going to be the only viable option. – Matthew Haugen Oct 19 '14 at 21:14