Reading very large text files, should I be incorporating async?

Question

I have been challenged with producing a method that will read in very large text files into a program these files can range from 2gb to 100gb.

The idea so far has been to read say a couple of 1000 lines of text into the method.

At the moment the program is setup using a stream reader reading a file line by line and processing the necessary areas of data found on that line.

using (StreamReader reader = new StreamReader("FileName"))
{
    string nextline = reader.ReadLine();
    string textline = null;

    while (nextline != null)
    {
        textline = nextline;
        Row rw = new Row();
        var property = from matchID in xmldata
                       from matching in matchID.MyProperty
                       where matchID.ID == textline.Substring(0, 3).TrimEnd()
                       select matching;

        string IDD = textline.Substring(0, 3).TrimEnd();

        foreach (var field in property)
        {
            Field fl = new Field();

            fl.Name = field.name;
            fl.Data = textline.Substring(field.startByte - 1, field.length).TrimEnd();
            fl.Order = order;
            fl.Show = true;

            order++;

            rw.ID = IDD;
            rw.AddField(fl);
        }
        rec.Rows.Add(rw);
        nextline = reader.ReadLine();

        if ((nextline == null) || (NewPack == nextline.Substring(0, 3).TrimEnd()))
        {
            d.ID = IDs.ToString();
            d.Records.Add(rec);
            IDs++;
            DataList.Add(d.ID, d);
            rec = new Record();

            d = new Data();
        }
    }
}

The program goes on further and populates a class. ( just decided not to post the rest)

I know that once the program is shown an extremely large file, memory exception errors will occur.

so that is my current problem and so far i have been googling several approaches with many people just answering use a stream reader and reader.readtoend, i know readtoend wont work for me as i will get those memory errors.

Finally i have been looking into async as a way of creating a method that will read a certain amount of lines and wait for a call before processing the next amount of lines.

This brings me to my problem i am struggling to understand async and i can't seem to find any material that will help me learn and was hoping someone here can help me out with a way to understand async.

Of course if anyone knows of a better way to solve this problem I am all ears.

EDIT Added the remainder of the code to put a end to any confusion.

I don't see any need for asynchrony here. It seems like you need to stream the data, but can process it entirely synchronously. — Servy, Apr 11 '13 at 16:28
Where do the memory exceptions occur (what line)? How are you reading the nextline? What is xmldata? How long is each line in the file? — Polyfun, Apr 11 '13 at 16:30
If your using .NET 4.0 the Task library is a great way to manage async threads. A BlockingCollection is a nice way to handle a consumer producer situation which you could use in this case. Read some data from the stream and add it to a queue then have one of more threads processing that queue. — cgotberg, Apr 11 '13 at 16:31
@cgotberg But using a producer consumer model would * increase* the memory footprint of the program, not decrease it. It would (if done properly) make the program faster, but that speed comes at the cost of more memory. — Servy, Apr 11 '13 at 16:34
It would increase the memory footprint compared to synchronously doing small batches of read then process. It you setup it up to pause reading when the queue got too big and waited for more stuff to process then you could manage the memory so that an out of memory exception isn't thrown. — cgotberg, Apr 11 '13 at 16:41
Another thing to watch out for is trying to retain data from a stream. You say you later "populate a class"; if that means you are saving things from a stream into memory, you are basically doing the same things as `ReadToEnd`. Streams are designed to used by acting on each element then discarding it. If you need memory access to each element you should be using a database, not a stream. — Dour High Arch, Apr 11 '13 at 16:49
To simplify your reading, consider using `foreach (var line in File.ReadLines("filename"))`. Whether or not you run out of memory in your program will depend on how much information you're storing for each line that you read. — Jim Mischel, Apr 11 '13 at 16:57
Well it seems i may have been confused, ill look into synchronously doing small batches. Ill then be able to refine the issues i am having with this method. — user2169674, Apr 12 '13 at 08:03
I have added the rest of the code to avoid confusion, and i just realised some of the varibles wont be known. the information gets added to DataList which is a dictionary. @DourHighArch — user2169674, Apr 12 '13 at 08:18

Binary Worrier · Accepted Answer · 2013-04-16T08:00:47.957

Your problem isn't synchronous v's asynchronous, it's that you're reading the entire file and storing parts of the file in memory before you do something with that data.

If you were reading each line, processing it and writing the result to another file/database, then StreamReader will let you process multi GB (or TB) files.

Theres only a problem if you're storing a portions of the file until you finish reading it, then you can run into memory issues (but you'd be surprised how large you can let Lists & Dictionaries get before you run out of memory)

What you need to do is save your processed data as soon as you can, and not keep it in memory (or keep as little in memory as possible).

With files that large you may need to keep your working set (your processing data) in a database - possibly something like SqlExpress or SqlLite would do (but again, it depends on how large your working set gets).

Hope this helps, don't hesitate to ask further questions in the comments, or edit your original question, I'll update this answer if I can help in any way.

Update - Paging/Chunking

You need to read the text file in chunks of one page, and allow the user to scroll through the "pages" in the file. As the user scrolls you read and present them with the next page.

Now, there are a couple of things you can do to help yourself, always keep about 10 pages in memory, this allows your app to be responsive if the user pages up / down a couple of pages very quickly. In the applications idle time (Application Idle event) you can read in the next few pages, again you throw away pages that are more than five pages before or after the current page.

Paging backwards is a problem, because you don't know where each line begins or ends in the file, therefore you don't know where each page begins or ends. So for paging backwards, as you read down through the file, keep a list of offsets to the start of each page (Stream.Pos), then you can quickly Seek to a given position and read the page in from there.

If you need to allow the user to search through the file, then you pretty much read through the file line by line (remembering the page offsets as you go) looking for the text, then when you find something, read in and present them with that page.

You can speed everything up by pre-processing the file into a database, there are grid controls that will work off a dynamic dataset (they will do the paging for you) and you get the benefit of built in searches / filters.

So, from a certain point of view, this is reading the file asynchronously, but that's from the users point of view. But from a technical point of view, we tend to mean something else when we talk about doing something asynchronous when programming.

hmm i thought i was processing the data after each line by saving it to a dictionary, but i understand that i shall get memory issues as i read the whole document line by line adding to the dictionary(eventually i reckon ill hit the memory cap). The reason i came to the idea of trying async was to wait for say the user to click a down arrow on the UI and process the next bunch of lines. — user2169674, Apr 12 '13 at 09:22
Dude, I'd need to know more about the problem you're trying to solve. Do you "process" the data then save it, or is this to display information to a user? If you're basically giving a user a "view" of the file then you need to read the file in chunks (or pages) as the user wants to view them. Can you tell me what you need to do, then I can update my answer (p.s. I've seen the code you've added to your question, that doesn't tell me what you **need** to do). — Binary Worrier, Apr 12 '13 at 10:17
my task is to develop a system that will take in giant text files and the program will provide them with a more readable view of the file by taking specific information and displaying it on the screen. At the moment my program goes through each line of the file until the end and takes the needed data saving each needed part to a dictionary. Hope this clears it up — user2169674, Apr 12 '13 at 10:27

Reading very large text files, should I be incorporating async?

1 Answers1

Linked