1

How can I split the workload of records across multiple threads specifically Accuracer DB that has 169 records with 7 threads for example.

Because I could just split the number of records in ranges and let each thread process the range. But if user deletes or adds record it will not work good.

ThinkingStiff
  • 64,767
  • 30
  • 146
  • 239
user299323
  • 55
  • 2
  • 8
  • 2
    Is your question "how to process it in threads?" or "how to divide the work so each thread have the same work to do?", or both?. – jachguate Mar 10 '13 at 06:12
  • Are you saying that numberOfRecordsPerThread = totalRecords/7 does not work because new records might come in? – angelatlarge Mar 10 '13 at 06:12
  • @jachguate how to divde the work so each thread has same workload :) Thanks for reminding me i will edit.. – user299323 Mar 10 '13 at 06:13
  • @angelatlarge what if record count change? User delete or add record then what? – user299323 Mar 10 '13 at 06:15
  • 2
    Well, after you assign the initial number of records to the threads, you if your have more records to process you could randomly assign them to threads. But what *I* would do is instead have a dispatcher class, and have threads request records as threads run out of records to process. If records are processed quickly you could give threads batches of 2 or 3 or 5 or 10 records. If slowly, then dispense records 1 at a time. – angelatlarge Mar 10 '13 at 06:18
  • And I would do this for the initial records as well. – angelatlarge Mar 10 '13 at 06:18
  • @angelatlarge interesting concept.. – user299323 Mar 10 '13 at 06:20
  • You have to define workload. If a thread, for some reason, spends more time processing one record, and you assign the same number of records to each thread, you may end up with threads starving while others are very busy. – jachguate Mar 10 '13 at 06:21
  • @jachguate work load = looping through records so each thread gets unique set of records to process – user299323 Mar 10 '13 at 06:24
  • 1
    Well, your question is interesting, but very general, so it is hard to write a specific answer for it. My general advise is to look at [OmniThreadLibrary](http://otl.17slon.com/) which IMHO is the best option you have to write this without much hassle. If you have something already done, it's better if you show what you have to get useful answers. – jachguate Mar 10 '13 at 06:34
  • @jachguate yes i use OmniThreadLibrary already. Now i just split records to ranges and tell each thread from-to loop :) With CreateTask.. So i want to know is there other way of doing this. – user299323 Mar 10 '13 at 06:35
  • Which abstraction are you using? – jachguate Mar 10 '13 at 06:38
  • @jachguate CreateTask() – user299323 Mar 10 '13 at 06:38
  • Well, you have better options there. ForEach and Pipeline comes to my mind. As said before, without more detail is hard to tell. Look at the 3.11.3 section of the OTL book. – jachguate Mar 10 '13 at 06:43
  • @jachguate For looping through things ForEach and Pipeline are better? – user299323 Mar 10 '13 at 06:46
  • For what you want to do, you can do it with both. What is better in your situation, I can't tell with what I know for now. – jachguate Mar 10 '13 at 06:49

1 Answers1

3

You can use OmniThreadLibrary to process records from a database in parallel without much hassle.

I wrote an example using the Pipeline abstraction. The pipeline consts of 3 stages:

  1. The first stage reads data from the database, creates a instance of the container object to represent that data for the next stage of the pipeline.
  2. The second stage processes the incoming data.

    • calls the DoSomethingWith procedure that simply wastes around 100 ms. to simulate the processing of the data
    • frees the memory of the container instance.
    • Then adds the literal value 1 to the output queue to inform the final stage that another record has been processed.

    This stage is configured to run in parallel in 7 threads.

  3. The last stage just counts how many records has been completed from the previous stage

The example is a console application to allow you just copy/paste to see it working live in your machine.

program Project1;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils,
  OtlCommon,
  OtlCollections,
  OtlParallel,
  System.Diagnostics,
  DB, DBClient;

type
  //auxiliar container, used to copy the database data
  //to avoid synchronization. remember TDataSet "current record"
  //may cause conflicts if changed from different threads.
  TContainer = class
  private
    FName: string;
    FID: Int64;
  public
    property ID: Int64 read FID write FID;
    property Name: string read FName write FName;
  end;

//does nothing, but wastes around 100ms. "processing" each record
procedure DoSomethingWith(const AValue: TContainer);
begin
  Sleep(100);
end;

//creates a DataSet on the fly with a random number of records
function CreateDataSet: TClientDataSet;
var
  I: Integer;
begin
  Result := TClientDataSet.Create(nil);
  with Result.FieldDefs.AddFieldDef do
  begin
    Name := 'ID';
    DataType := ftLargeint;
  end;
  with Result.FieldDefs.AddFieldDef do
  begin
    Name := 'NAME';
    DataType := ftString;
  end;
  Result.CreateDataSet;
  for I := 1 to Random(1000) do
    Result.InsertRecord([I, 'Test']);
end;

var
  RecordsProcessed: Integer;
  SW: TStopwatch;
  Data: TDataSet;
begin
  IsMultiThread := True;
  Randomize;
  Writeln('wait while processing...');
  SW := TStopwatch.Create;
  SW.Start;
  try
    Data := CreateDataSet;
    try
      RecordsProcessed := Parallel.Pipeline
        .Stage(
          procedure (const Input, Output: IOmniBlockingCollection)
          var
            RecData: TContainer;
          begin
            Data.First;
            while not Data.Eof do
            begin
              RecData := TContainer.Create;
              RecData.ID := Data.Fields[0].AsLargeInt;
              RecData.Name := Data.Fields[1].AsString;
              Output.Add(RecData);
              Data.Next;
            end;
          end)
        .Stage(
          procedure (const Input: TOmniValue; var Output: TOmniValue)
          begin
            //process the real thing here
            DoSomethingWith(Input);
            Input.AsObject.Free;
            Output := 1; //another record
          end)
        .NumTasks(7) //this stage is processed by 7 parallel tasks
        .Stage(
           procedure (const Input, Output: IOmniBlockingCollection)
           var
             Recs: Integer;
             Value: TOmniValue;
           begin
             Recs := 0;
             for Value in Input do
               Inc(Recs, Value);
             Output.Add(Recs);
           end)
        .Run.Output.Next;
      SW.Stop;
      Writeln(RecordsProcessed, ' records processed in ', SW.ElapsedMilliseconds, 'ms.');
      Writeln('Avg. ', (SW.ElapsedMilliseconds/RecordsProcessed):0:3, 'ms./record');
    finally
      Data.Free;
    end;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
  readln;
end.

The main advantages of doing it this way, IMHO, are:

  • you have a flexible mechanism to distribute the job between the multiple workers. If some record takes more time to process, the library takes care of the situation and you can reasonably expect to finish the total work in the less possible time.
  • You'r first processing thread starts as soon as you finish reading the first record from the database.
  • You can easily adapt it if you have to wait for more incoming records in the base table. The output queue of the stage will not be marked as finished until the code in the stage procedure ends. If at some time there's no more work to do, all the upcoming stages would just block waiting for more data to process.
  • You change the number of worker threads just by changing a parameter value!
jachguate
  • 16,976
  • 3
  • 57
  • 98
  • One question will this block main thread the GUI? Because the example in the OTL blocks GUI. :) – user299323 Mar 10 '13 at 08:11
  • In this example I needed it to _block_, because other way the application ends before the real works even begins. But you can change that, and the change is trivial when you know what you're doing. – jachguate Mar 10 '13 at 08:13
  • Hint: The `Run` method returns immediately. – jachguate Mar 10 '13 at 08:34
  • Is the data read directly from Dataset or copied? – user299323 Mar 10 '13 at 09:00
  • Hey, you have the (commented) code here to see it!. Why do you ask such question? Do you understand the code?, did you read the comments? – jachguate Mar 10 '13 at 09:04
  • I didnt use Pipeline before :) – user299323 Mar 10 '13 at 09:07
  • Well, the Pipeline itself doesn't copy anything. Just passes the values using the queues between the stages. If you have the book, go and read the 3.10 section. All you want to know is there. – jachguate Mar 10 '13 at 09:11