0

I am at the very beginning of learning C# and dotnet and currently working on a small project. The aim of the project is to combine two 200 MB csv files into one. Essentially it's the same file with the same items but in a different language. What I need to do is to read few columns from one file and add them to the other one by matching the item ID from both files.

The above I did (program runs quite fast, 24 seconds with +/- 60MB RAM) but....the app is utilizing only one thread to do this. What I would like to do is to divide the program to use two threads: One that does matching items by ID and creating new csv ready string (most of the logic, returns the string). Second one that picks the string from first one and handles writing it to the local file while the first one starts to work on the next line.

Is the above at all doable and if so could someone point me to the right direction?

  • 2
    Sounds like a good fit for producer-consumer pattern. Please, look at this answer - https://stackoverflow.com/a/42197839/4553518 – Alexander Goldabin Jan 14 '20 at 10:32
  • 1
    check [ConcurrentQueue](https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.concurrentqueue-1) to collect strings from one thread (`Enqueue` method) and to be a source for the file writing thread (`TryDequeue` method called in the loop). More than one matching thread can be used but you have to split second csv file by several parts. You can use [CancellationTokenSource.Token](https://learn.microsoft.com/en-us/dotnet/api/system.threading.cancellationtokensource) to send `stop` signal to the writing thread. – oleksa Jan 14 '20 at 10:35
  • Often when you're using the file system the use of threads actually slows down your code. – Enigmativity Jan 14 '20 at 11:17
  • @Enigmativity one thread should be used to write the file to avoid performance degradation. However multiple threads can be used to do some calculation (like joining csv files) – oleksa Jan 14 '20 at 11:43
  • @oleksa - That makes no sense. – Enigmativity Jan 14 '20 at 12:22
  • @Enigmativity probably however I've seen performance improvements using one thread to write the file and several - to create entries to be written. Comparing to the case when every thread tries to lock, open file, write and close it – oleksa Jan 14 '20 at 12:27
  • Do you have to read all lines of both CSV's is memory, before start combining them to the resulting CSV file? – Theodor Zoulias Jan 14 '20 at 12:30
  • @oleksa - Yes, if you have multiple threads simultaneously reading and writing files things can slow down. All I'm saying is that it is best to restrict file access to a single file at a time and that often takes far more time than any CPU-based processing. There is often very little benefit or worse a lot of degradation when using threads in reading and writing files. – Enigmativity Jan 14 '20 at 20:33
  • @TheodorZoulias no, I read the first file line by line and only store what I need from this file - item ID, param a, param b, param c. Then I close the StreamReader and open a new one for the second file, while reading the file per line I analyze the line, match the item IDs and combine the line with data stored from first file, then a new line is formed and passed to a new file via StreamWriter. What I would like to achieve is: – Michał Poterek Jan 16 '20 at 11:59
  • time ran out on the previous post Thread 1 - read the line from a file, combine with what is stored in memory(make a new longer line) and pass it to thread 2 Thread 2 - pick up the line from Thread 1 and append it to a new file while Thread 1 is already working on a next line from the first csv file. I am currently trying out the producer-consumer pattern from @AlexanderGoldabin, will post any changes – Michał Poterek Jan 16 '20 at 12:10
  • A tool well suited for this kind of job is the [TPL Dataflow](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) library. But it has some learning curve, so I wouldn't suggest it to a beginner to C#. Even using this tool, I wouldn't expect much improvement in terms of performance, because the job is not easily parallelizable. – Theodor Zoulias Jan 16 '20 at 13:59

1 Answers1

0

Solution was to use Async/Await and then read file 1 asynchronously and work the second part in the meantime. However it did not lead to any performance gains.

  • It's often the case that multithreading or parallelism won't lead to the same gains you're expecting, unless you're explicitly dealing with long operations that outweigh the overhead of managing multiple threads. – Jeremy Caney May 24 '21 at 21:19
  • Would you mind updating your answer to include the code you ended up scaffolding to evaluate this? That will better aid future readers with similar questions. – Jeremy Caney May 24 '21 at 21:20