0

I have a very large text file around 1 GB.

I need to count the number of words and characters (non-space characters).

I have written the below code.

string fileName = "abc.txt";
long words = 0;
long characters = 0;
if (File.Exists(fileName))
{
    using (StreamReader sr = new StreamReader(fileName))
    {
        string[] fields = null;
        string text = sr.ReadToEnd();
        fields = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
        foreach (string str in fields)
        {
            characters += str.Length;
        }
        words += fields.LongLength;
    }

    Console.WriteLine("The word count is {0} and character count is {1}", words, characters);
}

Is there any way to make it faster using threads, someone has suggested me to use threads so that it will be faster?

I have found one issue in my code that will fail if the numbers of words or characters are greater than the long max value.

I have written this code assuming that there will be only English characters, but there can be non-English characters as well.

I am especially looking for the thread related suggestions.

Vivek Nuna
  • 25,472
  • 25
  • 109
  • 197
  • 2
    @OlivierRogier yes, I have tried reading line by line also. But assume there is only a single large line in the whole file – Vivek Nuna Jan 16 '21 at 11:31
  • 6
    If there's only one line, the best you can do is read chunks of bytes of appropriate length and process them one at a time. – Camilo Terevinto Jan 16 '21 at 11:35
  • 3
    "Edit: I am especially looking for the thread related suggestions" -> *no*, you are not. Multiple threads reading the same file will only cause concurrency issues and either throw exceptions or read invalid data. Don't trust whatever people tell you, learn first. – Camilo Terevinto Jan 16 '21 at 11:38
  • @CamiloTerevinto you are absolutely right, but is there a way to improve performance using threads? – Vivek Nuna Jan 16 '21 at 11:39
  • @CamiloTerevinto Thank you for your suggestions, do we have any way we can make get rid of concurrency issues? – Vivek Nuna Jan 16 '21 at 11:41
  • 1
    The *only* way to use multiple threads here would be by splitting the file into enough different files, which means reading and writing the data back into disk, so it wouldn't improve anything. You could read multiple chunks and process them concurrently in memory, but the added complexity will almost certainly outweight any (likely small) performance benefit – Camilo Terevinto Jan 16 '21 at 11:41
  • @CamiloTerevinto how can we read chunks of bytes and process them concurrenctly. sorry I am new to this concept – Vivek Nuna Jan 16 '21 at 11:43
  • @OlivierRogier what is multisection web downloader – Vivek Nuna Jan 16 '21 at 11:44
  • 1
    That's *extremely* broad and would take hours to program. I'll give you the basics. You'd use `Read` to read chunks of bytes, get the appropriate characters from it and possibly send it to something like a TPL DataFlow block, [read here](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) – Camilo Terevinto Jan 16 '21 at 11:45
  • 1
    @CamiloTerevinto Thank you for your time, I will try the solutions which you have proposed. – Vivek Nuna Jan 16 '21 at 11:47
  • @OlivierRogier it can be one line or multiple lines. so the solution should be generic, not specific to any particular case – Vivek Nuna Jan 16 '21 at 11:49
  • @CamiloTerevinto do you have any suggestions for non-English characters? because I want to handle as many cases as possible – Vivek Nuna Jan 16 '21 at 11:50
  • `Is there any way to make it faster using threads, someone has suggested me to use threads so that it will be faster?` How fast is it now? How fast does it need to be? – mjwills Jan 16 '21 at 11:54
  • @OlivierRogier thank you, I have learned a lot on this question from you. So do you think all these points can improve the performance of my program for the given case where I need to count the number of words and characters? do you think programmatically my code can be improved – Vivek Nuna Jan 16 '21 at 11:56
  • Found that: [Multiprocess, multithreaded read write on a single file](https://codereview.stackexchange.com/questions/122918/multiprocess-multithreaded-read-write-on-a-single-file) –  Jan 16 '21 at 11:56
  • @mjwills that is what I don't know, because using threads can have a lot of side effects also. that is why I have asked this question, I may be wrong so I wanted experts suggestions – Vivek Nuna Jan 16 '21 at 11:57
  • @OlivierRogier 10 sections mean? 10 different files? and reading each file in a different thread? – Vivek Nuna Jan 16 '21 at 11:58
  • @OlivierRogier Is there a way to start reading from the 100MB position? I mean read file from one position to other position? – Vivek Nuna Jan 16 '21 at 12:01
  • 1
    @OlivierRogier thank you. haha, I agree with you it is a board question. I am fine with the Algo/pseudo code also – Vivek Nuna Jan 16 '21 at 12:09
  • `that is what I don't know` You don't know how fast it is now? And you don't know what your performance target is? How will you know the job is done then? – mjwills Jan 16 '21 at 12:13
  • @mjwills I have no strict target but I’m looking for a solution which is faster than my answer – Vivek Nuna Jan 16 '21 at 12:15
  • If you really want to try `Parallel.ForEach`, maybe start with https://stackoverflow.com/a/50589508/34092 to avoid allocating a huge array. I (very much) doubt it will help, but it would be easy to code and test. And it will force you to actually test what the method performs like now, to see if the second attempt is faster or not. ;) – mjwills Jan 16 '21 at 12:40
  • Hi @OlivierRogier. What's up with this "having only one line" phrase that comes and goes from the title and the body of the question? Honestly I am confused whether your edits preserve the goals of the post's owner, or conflict with their intent. – Theodor Zoulias Jan 19 '21 at 12:08
  • @OlivierRogier ha ha! I like it too. :-) I think that it would be appropriate to include in the question the caveat that the file can contain some very long lines, or even be composed by a single line, because this restricts the viable solutions to the problem. – Theodor Zoulias Jan 19 '21 at 12:23
  • @TheodorZoulias I am looking for a very generic solution, which covers most of the cases. Just imagine some of your clients has uploaded a very huge file at some location and you are reading it. and you don't know what is inside the file – Vivek Nuna Jan 19 '21 at 12:30
  • @OlivierRogier **viV@&{%?|’ no&;₹;₹; name** are 3 words. And count the nonspace characters – Vivek Nuna Jan 19 '21 at 12:43
  • @OlivierRogier My focus is on 1GB file – Vivek Nuna Jan 19 '21 at 13:05
  • @viveknuna I think that you should make it more prominent that no assumptions should be made about the size and the number of the lines/words in the file. This information should be in the body of the question, not hidden in the comments. – Theodor Zoulias Jan 19 '21 at 13:22
  • @TheodorZoulias I agree with you. but if no specific requirement if given then we should always try to cover all cases – Vivek Nuna Jan 19 '21 at 13:34
  • @viveknuna that's not exactly true. As programmers we are constantly making assumptions about the data we are working with. If you ask me to design a database that stores customer information, I'll assume that all customers all younger than 2,147,483,647 years, and their bank account hold less than 79,228,162,514,264,337,593,543,950,335 currency units (`decimal.MaxValue`). Similarly for a text file it's logical to assume that no line will be longer than, say, 100,000 characters, and with this assumption I could safely use the `File.ReadLines` method, without risking to run out of memory. – Theodor Zoulias Jan 19 '21 at 13:55

3 Answers3

8

Here is how you could tackle the problem of counting the non-whitespace characters of a huge text file efficiently, using parallelism. First we need a way to read blocks of characters in a streaming fashion. The native File.ReadLines method doesn't cut it, since the file is susceptible of having a single line. Below is a method that uses the StreamReader.ReadBlock method to grab blocks of characters of a specific size, and return them as an IEnumerable<char[]>.

public static IEnumerable<char[]> ReadCharBlocks(String path, int blockSize)
{
    using (var reader = new StreamReader(path))
    {
        while (true)
        {
            var block = new char[blockSize];
            var count = reader.ReadBlock(block, 0, block.Length);
            if (count == 0) break;
            if (count < block.Length) Array.Resize(ref block, count);
            yield return block;
        }
    }
}

With this method in place, it is then quite easy to parallelize the parsing of the characters blocks using PLINQ:

public static long GetNonWhiteSpaceCharsCount(string filePath)
{
    return Partitioner
        .Create(ReadCharBlocks(filePath, 10000), EnumerablePartitionerOptions.NoBuffering)
        .AsParallel()
        .WithDegreeOfParallelism(Environment.ProcessorCount)
        .Select(chars => chars
            .Where(c => !Char.IsWhiteSpace(c) && !Char.IsHighSurrogate(c))
            .LongCount())
        .Sum();
}

What happens above is that multiple threads are reading the file and processing the blocks, but reading the file is synchronized. Only one thread at a time is allowed to fetch the next block, by calling the IEnumerator<char[]>.MoveNext method. This behavior does not resemble a pure producer-consumer setup, where one thread would be dedicated to reading the file, but in practice the performance characteristics should be the same. That's because this particular workload has low variability. Parsing each character block should take approximately the same time. So when a thread is done with reading a block, another thread should be in the waiting list for reading the next block, resulting to the combined reading operation being almost continuous.

The Partitioner configured with NoBuffering is used so that each thread acquires one block at a time. Without it the PLINQ utilizes chunk partitioning, which means that progressively each thread asks for more and more elements at a time. Chunk partitioning is not suitable in this case, because the mere act of enumerating is costly.

The worker threads are provided by the ThreadPool. The current thread participates also in the processing. So in the above example, assuming that the current thread is the application's main thread, the number of threads provided by the ThreadPool is Environment.ProcessorCount - 1.

You may need to fine-tune to operation by adjusting the blockSize (larger is better) and the MaxDegreeOfParallelism to the capabilities of your hardware. The Environment.ProcessorCount may be too many, and 2 could probably be enough.

The problem of counting the words is significantly more difficult, because a word may span more than one character blocks. It is even possible that the whole 1 GB file contains a single word. You may try to solve this problem by studying the source code of the StreamReader.ReadLine method, that has to deal with the same kind of problem. Tip: if one block ends with a non-whitespace character, and the next block starts with a non-whitespace character as well, there is certainly a word split in half there. You could keep track of the number of split-in-half words, and eventually subtract this number from the total number of words.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • 1
    Would it be possible to return spans to avoid the array allocations? Also, I wonder how the above code will handle surrogate pairs (I don't know the answer). – mjwills Jan 16 '21 at 12:53
  • @mjwills probably. I am not familiar with `Span`s. They may have limitations that prevents them from being passed from method to method. I'll need to do some research. – Theodor Zoulias Jan 16 '21 at 12:55
  • @mjwills surrogate pairs are handled as expected: poorly! Each surrogate pair is counted as two characters. You could include a `Char.IsSurrogate` check in the PLINQ query, to count them off. – Theodor Zoulias Jan 16 '21 at 13:01
  • @mjwills changing the method's return type to `IEnumerable>` gives the following compile-time error: "CS0306 - The type 'Span' may not be used as a type argument". Another idea for avoiding multiple `char[]` allocations would be to use an [`ArrayPool`](https://learn.microsoft.com/en-us/dotnet/api/system.buffers.arraypool-1), but this would require the collaboration of the consuming side, by returning the processed arrays to the pool. So it's an efficiency <-> complexity trade off. – Theodor Zoulias Jan 16 '21 at 13:32
  • 1
    I experimented with an `ArrayPool`, and I discovered some difficulties. The pool can `Rent` arrays that are larger than the requested size, and throws exceptions when attempts are made to `Return` arrays that were not rented by the pool. What worked for me was to change the return type of the `ReadCharBlocks` method from `IEnumerable` to `IEnumerable>`, and call `arrayPool.Return(segment.Array)` when I am done with a segment. – Theodor Zoulias Jan 19 '21 at 05:56
1

This is a problem that doesn't need multithreading at all! Why? Because the CPU is far faster than the disk IO! So even in a single threaded application, the program will be waiting for data to be read from the disk. Using more threads will mean more waiting. What you want is asynchronous file IO. So, a design like this:-

main
  asynchronously read a chunk of the file (one MB perhaps), calling the callback on completion
  while not at end of file
    wait for asynchronous read to complete
    process chunk of data
  end
end

asynchronous read completion callback
  flag data available to process
  asynchronously read next chunk of the file, calling the callback on completion
end
Skizz
  • 69,698
  • 10
  • 71
  • 108
  • 1
    This answer _may_ rely on the assumption that the async IO is always faster. That isn't universally true. https://stackoverflow.com/a/39356462/34092 – mjwills Jan 16 '21 at 12:16
  • Is there any way to call the method on reading complete? – Vivek Nuna Jan 16 '21 at 12:18
  • @mjwills: Yes, there is extra set up time for an async read, but it's a big file so that overhead should be negligible when compared to everything else that's going on. For small file, you're right, it would be an issue. – Skizz Jan 16 '21 at 12:19
  • There's no reason you couldn't process the chunk of data in the callback, just make sure you start the next read before processing the just read data. – Skizz Jan 16 '21 at 12:20
0

You may get the file's length the beginning. let it be "S" (bytes). Then, let's take some constant "C".

Execute C threads, and let each one of them to process S/C length text. You may read all of the file at once and load in to your memory (if you enough RAM for this), or you may let every thread to read the relevant part of the file.

First thread will process byte 0 to S/C. Second thread will process byte S/C to 2S/C. And so on.

After all threads finished, summarize the counts.

How is that?

javadev
  • 688
  • 2
  • 6
  • 17
  • The idea of reading concurrently different parts of the file may be feasible with an SSD, but most probably will yield poor results with a classic hard disk. The head of the disk cannot be in N places at the same moment. – Theodor Zoulias Jan 23 '21 at 19:16
  • So as I said, you may read the whole file from a single thread, and just divide the processing between c threads. – javadev Jan 23 '21 at 19:18
  • Yes, that's your first suggestion. I am talking about the second suggestion: *"...let every thread to read the relevant part of the file"*. – Theodor Zoulias Jan 23 '21 at 19:27