2

Take for example the following code:

for i := (myStringList.Count - 1) DownTo 0 do begin
  dataList := SplitString(myStringList[i], #9);
  x := StrToFloat(dataList[0]);
  y := StrToFloat(dataList[1]);
  z := StrToFloat(dataList[2]);
  //Do something with these variables
  myOutputRecordArray[i] := {SomeFunctionOf}(x,y,z)
  //Free Used List Item 
  myStringList.Delete(i);
end;
//Free Memory
myStringList.Free;

How would you parallelise this using, for example, the OmniThreadLibrary? Is it possible? Or does it need to be restructured?

I'm calling myStringList.Delete(i); at each iteration as the StringList is large and freeing items after use at each iteration is important to minimise memory usage.

Trojanian
  • 692
  • 2
  • 9
  • 25
  • 1
    Though *"freeing items after use at each iteration is important"* for you, it may have a performance impact. If you're thinking of threads, that string list will need to be locked for each access (by the critical section) and that naturally leads to processing those items in batches (several lines in batch). But I'm talking just about the workflow you described (having a string list from which you'll be deleting processed items). – TLama Nov 25 '14 at 22:46
  • Are you quite sure that this is a bottleneck and that parallel execution in threads is the solution? If memory use is the problem then your solution won't help. You already have an instant in time when all the strings are in memory. Almost certainly your proposed solution won't help. Tell us about the problem. – David Heffernan Nov 25 '14 at 23:20
  • @DavidHeffernan If `{SomeFunctionOf}` returns an object then memory could increase during the processing. I would agree though that we need more information about the problem as there are more than likely going to be a lot better and smarter solutions. – Graymatter Nov 25 '14 at 23:32
  • OK, let's ask the obvious question first. If execution halts halfway through and the process hangs when using Graymatter's solution, did it do the same before you implemented Graymatter's solution? – Mason Wheeler Nov 26 '14 at 01:24
  • @MasonWheeler: Do you mean does it still hang if I omit the line `myStringList[i] := '';`? You can see it doesn't hang in the sequential implementation. – Trojanian Nov 26 '14 at 01:28
  • Sure, but that was never in question. Does it hang in the parallel implementation if you don't try to do anything like this and leave all cleanup until the end? You have to establish the baseline before you can determine if the changes broke anything. – Mason Wheeler Nov 26 '14 at 01:39
  • @MasonWheeler: The problem in the parallel implementation seems to be with the line `dataList := SplitString(myStringList[i], #9);`. – Trojanian Nov 26 '14 at 01:54
  • 2
    Your parallel implementation is badly flawed. Random is not threadsafe. Is TAppender? Shared data kills scaling as Mason says. I still doubt that your proposed solution can help. By a distance the disk IO is the bottleneck. Serial code will be the fastest. – David Heffernan Nov 26 '14 at 07:15
  • @Trojanian If the Parallel version doesn't finish then you are getting an exception in your code. You edit changes the question quite a lot. It's no longer a function returning a result and setting the index of an array but now quite different. I tested the block of code with the myOutputRecordArray line commented out and it worked as expected. – Graymatter Nov 26 '14 at 07:16
  • To make matters worse, your parallel code is different from mine. You have not declared the variables inside the threaded block as I explained. You are using local variables for dataList, x, y and z. That will cause an access violation because you are writing to the same variables from multiple threads. – Graymatter Nov 26 '14 at 07:22
  • @Graymatter - You're right. I didn't realise I had introduced errors into the example, sorry. I've deleted the edit. Coding it in your way does work but, interestingly, it doesn't seem to actually run in parallel. – Trojanian Nov 26 '14 at 15:51
  • That's because your bottleneck is the IO as @DavidHeffernan said. – Graymatter Nov 26 '14 at 16:39
  • @Graymatter Not sure about that. When reading the file I'd expect very low CPU because the IO will block and therefore free the CPU. My guess is that asker really is seeing 25% CPU during that parallel for. If the list is large enough then that parallel for, the way you wrote it, should benefit from threads. But my point is that with a pipeline you just don't need any threads and can get the job done even faster. – David Heffernan Nov 26 '14 at 16:41
  • @DavidHeffernan The only other thing I can think of is that the TAppender is preventing any meaningful threading benfits. – Graymatter Nov 26 '14 at 16:48
  • @Graymatter: The 25% CPU usage is in the parallel for loop and I didn't use the `TAppender` when I last tested it - I ran it as you did, with the myOutputRecordArray commented out. The list should definitely be long enough to benefit from threads, it's 20 million lines long. – Trojanian Nov 26 '14 at 17:02
  • Ok, I know what the problem is. I will update my answer in about an hour. The memory manager is actually the "shared state" that Mason was referring to so it's forcing the threads to synchronise. We can work around it. – Graymatter Nov 26 '14 at 17:16
  • @Graymatter did you find a way around this? – Trojanian Nov 30 '14 at 17:32

3 Answers3

2

Simple answer: You wouldn't.

More involved answer: The last thing you want to do in a parallelized operation is modify shared state, such as this delete call. Since it's not guaranteed that each individual task will finish "in order"--and in fact it's highly likely that they won't at least once, with that probability approaching 100% very quickly the more tasks you add to the total workload--trying to do something like that is playing with fire.

You can either destroy the items as you go and do it serialized, or do it in parallel, finish faster, and destroy the whole list. But I don't think there's any way to have it both ways.

Mason Wheeler
  • 82,511
  • 50
  • 270
  • 477
2

You can cheat. Setting the string value to an empty string will free most of the memory and will be thread safe. At the end of the processing you can then clear the list.

Parallel.ForEach(0, myStringList.Count - 1).Execute(
  procedure (const index: integer)
  var
    dataList: TStringDynArray;
    x, y, z: Single;
  begin
    dataList := SplitString(myStringList[index], #9);
    x := StrToFloat(dataList[0]);
    y := StrToFloat(dataList[1]);
    z := StrToFloat(dataList[2]);
    //Do something with these variables
    myOutputRecordArray[index] := {SomeFunctionOf}(x,y,z);
    //Free Used List Item
    myStringList[index] := '';
  end);
myStringList.Clear;

This code is safe because we are never writing to a shared object from multiple threads. You need to make sure that all of the variables you use that would normally be local are declared in the threaded block.

Graymatter
  • 6,529
  • 2
  • 30
  • 50
  • Thanks for this. This code runs but it doesn't seem to run in parallel. It runs with the same processor usage (~25% in my quad core) as the sequential implementation above. – Trojanian Nov 26 '14 at 15:53
1

I'm not going to attempt to show how to do what you originally asked because it is a bad idea that will not lead to improved performance. Not even assuming that you deal with the many and various data races in your proposed parallel implementation.

The bottleneck here is the disk I/O. Reading the entire file into memory, and then processing the contents is the design choice that is leading to your memory problems. The correct way to solve this problem is to use a pipeline.

Step 1 of the pipeline takes as input the file on disk. The code here reads chunks of the file and then breaks those chunks into lines. These lines are the output of this step. The entire file is never in memory at one time. You'll have to tune the size of the chunks that you read.

Step 2 takes as input the strings the step 1 produced. Step 2 consumes those strings and produces vectors. Those vectors are added to your vector list.

Step 2 will be faster than step 1 because I/0 is so expensive. Therefore there's nothing to be gained by trying to optimise either of the steps with parallel algorithms. Even on a uniprocessor machine this pipelined implementation could be faster than non-pipelined.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490