1

I am reading very wide file with delphi

The file is comma delimited, most of the time is spend parsing strings.

The logic is follows

  1. open file
  2. read line
  3. split line into array of records
  4. Pass spitted array to the next procedure
  5. go to step 2
  6. close file.

I want to run step 3 in parallel and currently looking at OmniThreadLibrary.

What would be the best approach?

Shall I use Parallel For? Pipelene? or Queue?

I am thinking about using 'Parallel For' but the problem is that i do not know how many lines the file have

  • 1
    A couple of questions you should ask yourself: 1) Can your program handle that the lines are processed "out of order" from the file? 2) Does your processing of the split array involve any UI updating, or is it purely data update? 3) What makes you think it will be better/faster to start running multi-threaded? Remember, you will add a big complexity on top of your program, and if the speed gain doesn't pay off, you will end up with a difficult-to-maintain program that won't be much (if at all) better than a single-threaded version... – HeartWare Mar 17 '14 at 13:34
  • 1. Out of order is fine 2. Just data update 3. It takes 1 minute to read file line by line without parsing and 20 minutes with parsing (the file is very wide) – user3428876 Mar 17 '14 at 13:46

3 Answers3

2

There's nothing to be gained from using multiple threads to read the file. That part of the procedure is I/O bound rather than CPU bound. So you are best to read the entire file from a single thread.

You then need to split the file into lines. That's something that is hard to do in parallel again because there is an issue of dependency. Line N+1 starts where line N ends. It will be simplest to do the splitting into lines in a single thread.

But you can run a pipeline between the I/O and the splitting into lines. Read the file in large chunks (say tens of KBs at a time). And pass each chunk down the pipeline to be processed into lines. You might need to place an upper bound on how much data is allowed to sit in the pipeline at any one moment. Otherwise you might exhaust memory if the file can be read more quickly than it can be processed.

So for this pipeline, you have a producer that reads the file, and a consumer that splits the contents of the file into lines.

Then you can run another pipeline. At the producer end you have list of lines produced by the previous step. That's pushed down the pipeline to the consumer which processes each line. The consumer will do that with parallel for.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • Just to clarify file is comma delimited, reading the file in single thread is very fast, the problem is splitting comma delimited lines it takes a lot of time so in theory doing this splitting in parallel should speed it up – user3428876 Mar 17 '14 at 13:53
  • @user3428876 Yes, that will be CPU and memory bound so parallel for should work well.It might also pay dividends to optimise the split code. – David Heffernan Mar 17 '14 at 14:03
  • My problem is that i do not know file size in lines and I am not able to find an example for my situation PrimeCount.Value := 0; Parallel.ForEach(1, 1000000).Execute( procedure (const value: integer) begin if IsPrime(value) then PrimeCount.Increment; end; end); – user3428876 Mar 17 '14 at 14:07
  • It would not be enough to know how many lines there are. You have to also know where each line starts and finishes. But that's the point of the first parts of my answer. Of course, I'm assuming that you don't want to read the entire file into memory. If you want to do that then you just read it in and use a parallel for. You still have not told us how large the file is. – David Heffernan Mar 17 '14 at 14:17
  • File is about 20 gis, and I do not want to read entire file into the memory, the part which splits file into individual lines works fine, the the part which splits the line into array of records is very slow. So my question is how to do parallel for when number of lines is unknown – user3428876 Mar 17 '14 at 14:23
  • The bulk of my answer covers how to do that. – David Heffernan Mar 17 '14 at 14:30
  • A simple approach would be, read the lines and when each line is read throw that line to the worker (thread) that parses it and does something with the output. But limit the max amount of workers to the amount of cores. The surplus lines are then put into the queue of each worker. – Runner Mar 17 '14 at 17:59
  • @Runner OTL will deal with most of those details for you. – David Heffernan Mar 17 '14 at 18:01
  • @David I know, I just wanted the OP to understand the principle behind it. I got the impression that he does not understand how the problem should be approached. But I can be wrong. – Runner Mar 17 '14 at 18:14
0

Splitting the parsing up into chunks of, say, 10.000 lines may be an option. I do not know OmniThread Library, so the <Do Parallel For on ARR> part you must do yourself, but the basic structure of the code goes something like this:

CONST ChunkSize = 10000;

VAR ARR : ARRAY[1..ChunkSize] OF STRING;
VAR Lines : Cardinal;
VAR TXT : TextFile;
VAR FileName : STRING;

Lines:=0;
AssignFile(TXT,FileName); RESET(TXT);
WHILE NOT EOF(TXT) DO BEGIN
  IF Lines=ChunkSize THEN BEGIN
    <Do Parallel For on ARR>;
    Lines:=0
  END;
  INC(Lines);
  READLN(ARR[Lines])
END;
<Do Parallel For on ARR - only "Lines" lines>

Note, that the code assumes that the <Do Parallel For on ARR> part only continues once all entries in the array have been processed.

HeartWare
  • 7,464
  • 2
  • 26
  • 30
0

You don't need to know the overall number of lines to use Parallel-For as you can use a Blocking Collection to iterate over. Just don't miss to call CompleteAdding when you have added the last line.

Be aware that the performance of Parallel-For may degrade heavily when each single task needs only a small amount of time compared to the thread and queue management.

You might also consider using the BackgroundWorker abstraction and schedule multiple lines in each WorkItems.

Uwe Raabe
  • 45,288
  • 3
  • 82
  • 130