13

Scenario - 150MB text file which is the exported Inbox of an old email account. Need to parse through and pull out emails from a specific user and writes these to a new, single file. I have code that works, its just dogged slow.

I'm using marker strings to search for where to begin/end the copy from the original file.

Here's the main function:

 StreamReader sr = new StreamReader("c:\\Thunderbird_Inbox.txt");
        string working = string.Empty;
        string mystring = string.Empty;
        while (!sr.EndOfStream)
        {
            while ((mystring = sr.ReadLine()) != null)
            {
                if (mystring == strBeginMarker)
                {
                    writeLog(mystring);

                    //read the next line
                    working = sr.ReadLine();

                        while( !(working.StartsWith(strEndMarker)))
                        {
                            writeLog(working);
                            working = sr.ReadLine();

                        }
                  }
            }

        }
        this.Text = "DONE!!";
        sr.Close();

The function that writes the selected messages to the new file:

  public void writeLog(string sMessage)
    {
            fw = new System.IO.StreamWriter(path, true);
            fw.WriteLine(sMessage);
            fw.Flush();
            fw.Close();
    }

Again, this process works. I get a good output file, it just takes a long time and I'm sure there are ways to make this faster.

erbridge
  • 1,376
  • 12
  • 27
paparush
  • 1,340
  • 1
  • 17
  • 25
  • BTW - You might want to consider the using statement instead of Close() manually - it's safer in case you hit an exception. My example demonstrates... – Reed Copsey Jan 20 '11 at 19:19
  • 1
    `while (!sr.EndOfStream)` is redundant with `while ((mystring = sr.ReadLine()) != null)` – Kirk Woll Jan 20 '11 at 19:20

5 Answers5

19

The largest optimization would be to change your writeLog method to open the file once at the beginning of this operation, write to it many times, then close it at the end.

Right now, you're opening and closing the file each iteration where you write, which is going to definitely slow things down.

Try the following:

// Open this once at the beginning!
using(fw = new System.IO.StreamWriter(path, true))
{
    using(StreamReader sr = new StreamReader("c:\\Thunderbird_Inbox.txt"))
    {
        string working;
        string mystring;
        while ((mystring = sr.ReadLine()) != null)
        {
           if (mystring == strBeginMarker)
           {
                writeLog(mystring);

                //read the next line
                working = sr.ReadLine();

                while( !(working.StartsWith(strEndMarker)))
                {
                    fw.WriteLine(working);
                    working = sr.ReadLine();
                }
            }
        }
    }
}
this.Text = "DONE!!";
Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
2

I think you should:

  1. Open files once.
  2. Load source file in memory.
  3. Break it and use several threads for processing.
acoolaum
  • 2,132
  • 2
  • 15
  • 24
  • 1
    While I like this answer, in theory - in practice, it will probably not help much. The OP is still likely to be completely IO bound on the output (since it's being written to a single output file), so multithreading will likely yield very little benefit. I don't know if it'd be worth the complexity. – Reed Copsey Jan 20 '11 at 19:42
  • I agree with Reed here. How would you break it into threads? You'd still need to have logic to do so. If you broke it into chunks of equal size, you'd need to handle the case where a single message starts in one chunk and ends in another. If you did logic to break it at a marker boundary, you are pre-parsing before adding the overhead of threading, probably making it *less* efficient. That's not to say it couldn't be done - it's just much more work than it seems worth. – Wonko the Sane Jan 20 '11 at 20:05
  • @Wonko the Sane, I think the simplest way is first - break it for equal parts and find begining of the first message in each, second - start thread proccessing from first message of each section. – acoolaum Jan 20 '11 at 20:24
  • @acoolaum - but, you aren't handling the case where the break is mid-message (the start marker is in chunk #1, the end marker is in chunk #2). Plus, there is no "processing" needed in this case - the task is, essentially, to take a number of substrings out of a string and plop them in another string. – Wonko the Sane Jan 20 '11 at 20:40
  • @Wonko the Sane, the task is clear for me. In first step you have 2 parts: placing pointers of parts to calculated initial position then move pointer of each part to first message in this part. – acoolaum Jan 20 '11 at 21:00
  • @acoolaum - but what if your calculated pointer (I assume you mean a block of size n) falls in the middle of a message? – Wonko the Sane Jan 20 '11 at 21:08
  • @Wonko the Sane, For example, if initial pointer of "section1" falls in the middle of message I move it to the end of this message and increase "section0" lenght before second phase (processing). – acoolaum Jan 20 '11 at 22:02
  • Right - that's my point. :) In order to break the file into appropriate pieces, you have to parse it somehow. You need to know you are in the middle of the message. However, the entire task in this case *is* to parse the file. No further processing on those chunks is necessary. – Wonko the Sane Jan 21 '11 at 13:52
2

I would just do a simple parser. Note that this assumes (as you do in your code above) that the markers are in fact unique.

You may have to play with the formatting a bit of your output, but here is the general idea:

   // Read the entire file and close it
   using (StreamReader sr = new
   StreamReader("c:\\Thunderbird_Inbox.txt");)
   {
       string data = sr.ReadToEnd();   
   }

   string newData = "";   
   int position = data.IndexOf(strBeginMarker);

   while (position > 0)   
   {
      int endPosition = data.IndexOf(endMarker, position);
      int markerLength = position + strBeginMarker.Length;

     newData += data.Substring(markerLength, endPosition - markerLength);

     position = data.IndexOf(strBeginMarker, position+ endStr.Length);   
   }

  writeLog(newData);

(Note that I don't have a 150 MB file to test this on - YMMV depending on the machine you are using).

Wonko the Sane
  • 10,623
  • 8
  • 67
  • 92
0

I do not have a 150MB text file to test, but if your server has the memory would Reading the hold thing into a string and doing a RegEx pulling out the message work?

Mike
  • 5,918
  • 9
  • 57
  • 94
  • 1
    RegEx would work, but they can be fairly complex and difficult to get "just right." The parser solution I gave is essentially the same idea, but simpler for an less experienced coder. – Wonko the Sane Jan 20 '11 at 19:48
0

You could simply declare the StreamWriter object outside of that while loop and just write the line to it inside the loop.

Like this:

StreamWriter sw = new StreamWriter(path, true);
while
{
    // ...
    while( !(working.StartsWith(strEndMarker)))
    {
        sw.WriteLine(working);
        working = sr.ReadLine();
    }
}
Smur
  • 3,075
  • 7
  • 28
  • 46