3

I am working on a C# program that reads in very large files and is checking them for different attributes and fields. I had been testing with files with under 1 million lines and it was preforming as expected. I have recently tested it on a file with 2.5 million lines and it took 4 hours to run through.

I am using a custom Reading function to read in each character so that I can find all CR and LF because it is very important that every line contains them. I have tested the Reading function separately and it look about 14 minutes to read the file, which I find reasonable enough to read every character in a 2.5 million lines with 1500 characters. I will included me Reading function, however this doesn't seem to be causing the issue.

My reading function adds each character to a string and then I check different values in the string. For example, is line length is correct, does file contains a header, and does the header contain the correct values. As well as specific values like is char position 403-404 a number, is field 1250-1300 not null, etc.

My question is what can I do to figure out what is causing the slow down and increase my efficiency of my program? I have tried checking the time at the beginning and end of each line loop and it doesn't seem to change. However, every 100,000 takes significantly longer than the previous. As an example, processing line 10,000 to 20,000 took less than 3 seconds and 830,000 to 840,000 took about 35 seconds. I have considered trying to multiple threads but don't think it will help in my case with reading lines from a file. Thoughts? Thanks for the help!

    static void ReadMyLine(ref string currentLine, string filePath, ref int asciiValue, ref Boolean isMissingCR, ref Boolean isMissingLF, ref Boolean isReversed, ref StreamReader file)
    {
        Boolean endOfRow = false;
        isMissingCR = false;
        isMissingLF = false;
        isReversed = false;

        currentLine = "";

        while (endOfRow == false)
        {
            asciiValue = file.Read();

            if (asciiValue == 10 || asciiValue == 13)
            {
                int asciiValueTemp = file.Peek();

                if (asciiValue == 13 && asciiValueTemp == 10)
                {
                    endOfRow = true;
                    asciiValue = file.Read();
                }
                else if (asciiValue == 10 && asciiValueTemp == 13)   // CRLF Reversed
                {
                    asciiValue = file.Read();
                    endOfRow = true;
                    isReversed = true;
                }
                else if (asciiValue == 10)                           // Missing CR
                {
                    isMissingCR = true;
                    endOfRow = true;
                }
                else if (asciiValue == 13)                           // Missing LF
                {
                    isMissingLF = true;
                    endOfRow = true;
                }
                else
                    endOfRow = true;
            }
            else if (asciiValue != -1)
                currentLine += char.ConvertFromUtf32(asciiValue);
            else
                endOfRow = true;
        }
    }
buzzzzjay
  • 1,140
  • 6
  • 27
  • 54

1 Answers1

11

Here's the first thing I looked for, and the first thing I'd change:

currentLine += char.ConvertFromUtf32(asciiValue);

Don't do that. Using string concatenation in a loop can kill performance - you get O(N2) complexity. Use a StringBuilder instead. See my article on when to use StringBuilder for more explanation.

There may well be more you can do, but just changing to use StringBuilder will be a huge improvement:

StringBuilder builder = new StringBuilder();
while (...)
{
    ...
    builder.Append(char.ConvertFromUtf32(asciiValue));
}
currentLine = builder.ToString();

It's also unclear why you have so many ref parameters. Why are you passing in asciiValue at all? Why are you passing the StreamReader by reference? Anything using this many ref parameters makes me very nervous - why don't you have a type which encapsulates everything you really want to return from the method?

You may want to read my article on parameter passing to get a better understanding of ref.

Christophe Geers
  • 8,564
  • 3
  • 37
  • 53
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    I agree with Jon. Each string you concatenate is actually a copy of the string before it re-created in memory and referenced on the heap - having a lot of these will certainly create an application performance hit. – dooburt Sep 09 '11 at 14:25
  • @Jon That kind of signature with a static method and all those ref parameters is often caused by selecting a bunch of code in the middle of a method and using Visual Studio's "Extract Method" option. It is not very intelligent. – Ben Robinson Sep 09 '11 at 14:32
  • I was using the ref parameters to pass values out. I don't pass anything in with asciiValue, but do want the last value stored in it so I pass it out using a ref parameter? Do you have a recommendation on a way to do it better? – buzzzzjay Sep 09 '11 at 14:44
  • 2
    @buzzzzjay: You're never changing the value of `file`, so why is that `ref`? If you only want the value of `asciiValue` *out*, you should use an `out` parameter - or a return value! Having a void method with ref parameters is really ugly. But it looks like really the method wants to return several pieces of information: so declare a type to encapsulate that information, and make that the return type of the method. – Jon Skeet Sep 09 '11 at 14:49
  • @Mathieu: Link works for me. If pobox.com is blocked for you, try http://www.yoda.arachsys.com/csharp/parameters.html – Jon Skeet Sep 09 '11 at 14:49
  • Changing to a StringBuilder increased 10,000-20,000 run time to 1.08 seconds from 3 seconds. I didn't realize sting concatenation was that bad, and caused such a slow down. – buzzzzjay Sep 09 '11 at 14:50
  • @buzzzzjay: It's really important that you understand *why* it was happening. Read the article I linked to. – Jon Skeet Sep 09 '11 at 14:52
  • @Jon Skeet: I believe the reason I passed the file as a ref was because as the file is read it consumes the information and if it is not reference it kept re-reading the same line. – buzzzzjay Sep 09 '11 at 14:53
  • 1
    @buzzzzjay: That suggests you don't understand how parameter passing works in C# when it comes to reference types such as `StreamReader`. Read the second link as well... – Jon Skeet Sep 09 '11 at 14:54
  • @Joh Skeet: I have read it and don't see how else I could handle StreamReader. Any sugestions? – buzzzzjay Sep 09 '11 at 15:52
  • 1
    @buzzzzjay: Yes - pass it *not* by reference. You're only passing the *reference*, not the object itself. Changes made to the object due to reading from it will still be visible to the caller. I suggest you read the article again, really carefully - particularly the example where a `StringBuilder` reference is passed by value, and then " world" is appended to it. – Jon Skeet Sep 09 '11 at 16:05
  • @Jon Skeet: don't know what I did previously but it did work even though I wasn't passing by reference. Thanks for the help, this did increase run time of every 10000 lines slightly. – buzzzzjay Sep 09 '11 at 17:23