5

I am working on a C# program to determine the line length for each row in multiple large text files with 100,000+ rows before importing using an SSIS package. I will also be checking other values on each line to verify they are correct befor importing them into my database using SSIS.

For example, I am expecting a line length of 3000 characters and then a CR at 3001 and LF at 3002, so overall a total of 3002 characters.

When using ReadLine() it reads a CR or LF as and end of line so that I can't check the CR or LF characters. I had been just checking the length of the line at 3000 to determine if the length was correct. I have just encountered an issue where the file has a LF at position 3001 but was missing the CR. So ReadLine() says it is 3000 char witch is correct but it will fail in my SSIS package because it is missing a CR.

I have verified that Read() will reach each char 1 at a time and I can determine if each line has a CR and LF but this seems rather unproductive, and when some files I will encounter with have upwards of 5,000,000+ rows this seems very inefficient. I will also need to then add each char into a string or use ReadBlock() and convert a char array into a string so that I can check other values in the line.

Does anyone have any ideas on an efficient way to check the line for CR and LF and other values on a given line without wasting unnecessary resources and to finish in a relatively timely manner.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
buzzzzjay
  • 1,140
  • 6
  • 27
  • 54
  • *Note: I tried using Peek() with ReadLine() and it starts reading the next row instead of reading the CR and LF. I was hoping this would have been an easy solution. It appears that once ReadLine() is used it removes the CR and LF from the StreamReader – buzzzzjay Sep 01 '11 at 21:56
  • 1
    For clarification: although you're checking for valid data, is the import done on the original raw file, or from the data you've already loaded into your C# program? I'm guessing the former, but wanted to be sure. – JaredReisinger Sep 01 '11 at 22:14
  • 1
    It *seems* inefficient? File reading is throttled by the hard disk or network speed. You can use StreamReader.Read(char[], int, int) to read a bunch of characters. – Hans Passant Sep 01 '11 at 22:15
  • @JaredReisinger I am doing the import on the raw data file. I am trying to do a "pre check" on the data to keep the import process from failing because I am sent a file with bad data. – buzzzzjay Sep 01 '11 at 22:28

5 Answers5

4

have verified that Read() will reach each char 1 at a time and I can determine if each line has a CR and LF but this seems rather unproductive

Think about this. Do you think ReadLine() has a magic wand and does not have to read each char?

Just create your own ReadMyLine(). Something has to read the chars, it doesn't matter if that's your code or the lib. I/O will be buffered by the Stream and Windows.

Ian G
  • 29,468
  • 21
  • 78
  • 92
H H
  • 263,252
  • 30
  • 330
  • 514
  • I am not against creating my own code but efficiency is very important when using this to check 100,000s of lines. I just can't believe this is not something that others haven't already encountered or not a default function. Do you have any suggestions on how to go about creating ReadMyLine(). – buzzzzjay Sep 01 '11 at 22:26
  • I have mostly created A "ReadMyLine()" Please see above. Thanks! – buzzzzjay Sep 02 '11 at 15:31
1

I may be missing something here, but if the data in each line is always exactly 3000 characters (excluding CR and LF)?

Why not just read each line and then take the first 3000 characters only, using string.Substring(). This way you don't have to worry about exactly how the string is terminated.

ie

 using (StreamReader sr = new StreamReader("TestFile.txt")) 
    {
       String line;
       while ((line = sr.ReadLine()) != null) 
          {
            // string data = line.subString(0,3000); 
            // edit, if data is sometimes < 3000 ....  
            string data = line.subString(0,line.length < 3000 ? line.length : 3000);
            // do something with data
          }
     }
Ian G
  • 29,468
  • 21
  • 78
  • 92
  • I had previously been using a method similar to this. The line should always be 3000 characters excluding the CR and LF. However, it is not always and that's why I am needing to check the length because I am getting files from a lot of different sources that are not always the correct length. If the length is less than 3000 characters and you substring it it would fail and throw and exception. – buzzzzjay Sep 01 '11 at 22:18
  • 2
    @buzzz Don't 'think' something is slow, measure. Your chief cost will be I/O, not char/string processing. I would use a `while(...) { int ch = s.Read(); ... }` – H H Sep 01 '11 at 22:36
  • well you could take `line.length < 3000 ? line.length : 3000` – Ian G Sep 02 '11 at 08:28
1

Can you use an override of StreamReader.Read OR an override of TextReader.Read which accepts 3 parameters - string buffer (in your case a 3002 character array), startint index (you will handle this in a loop each time incrementing the index by 3002), number of characters to read (3002). From the read buffer, you can check the last two characters for your conditional evaluation of CR and LF.

Arun
  • 2,493
  • 15
  • 12
  • I could and I am currently trying this as possibility. However it is extremely inefficient in tests I have already attempted when using with files that contain 100,000s of records. – buzzzzjay Sep 01 '11 at 22:21
  • An alternative would be to use TWO StreamReaders - One where you would use the ReadLine to read the line and the other to Read just the Last TWO characters into a char[2] buffer. Any time the last two fail to have CR+LF, you know the line has a problem. This way, you don't spend up using 3002 character array repeatedly on a loop. – Arun Sep 01 '11 at 22:44
1

I believe you will find this version to be efficient:

    static bool CheckFile(string filename)
    {
        const int BUFFER_SIZE = 3002;

        var Reader = new StreamReader(filename, Encoding.ASCII, false, BUFFER_SIZE);

        var buffer = new char[BUFFER_SIZE];

        int offset = 0;
        int bytesRead = 0;

        while((bytesRead = Reader.Read(buffer, offset, BUFFER_SIZE)) > 0)
        {
            if(bytesRead != BUFFER_SIZE 
                || buffer[BUFFER_SIZE - 2] != '\r' 
                || buffer[BUFFER_SIZE - 1] != '\n')
            {
                //the file does not conform
                return false;
            }

            offset += bytesRead;
        }

        return true;
    }

The reason I'm optimistic about this is that according to the docs, efficiency is increased if the size of the underlying buffer is matched to the buffer that is used for reading. Caveat: this code has not been tested or timed.

Paul Keister
  • 12,851
  • 5
  • 46
  • 75
0

I think I have finally figured out the code to get exactly what I want, thoughts? The main issue I was encountering was that I am not guaranteed my line length is going to correct. Other wise the method mentioned by @Paul Keister would have worked great, and did as I tested it. Thanks for the help!

int asciiValue = 0;

while (asciiValue != -1)
{

Boolean endOfRow = false;
Boolean endOfRowValid = true;

string currentLine = "";

while (endOfRow == false)
{
    asciiValue = file.Read();

    if (asciiValue == 10 || asciiValue == 13)
    {
        int asciiValueTemp = file.Peek();

        if (asciiValue == 13 && asciiValueTemp == 10)
        {
            endOfRow = true;
            asciiValue = file.Read();
        }
        else
        {
            endOfRowValid = false;
            endOfRow = true;
        }
    }
    else if (asciiValue != -1)
        currentLine += char.ConvertFromUtf32(asciiValue);
    else
        endOfRow = true;
}

Edit: I forgot to mention that this seems to be just as efficient as using ReadLine(). I was really afraid this wouldn't have worked as well. It appears I was wrong.

buzzzzjay
  • 1,140
  • 6
  • 27
  • 54