1

This is pretty straight forward:

Say I have many files in the folder data/ to process via some executable ./proc. What is the simplest way to maximize efficiency? I have been doing this to gain some efficiency:

ls --sort=size data/* | tac | parallel ./proc

which lists the data according to size, then tac (reverse of cat) flips the order of that output so the smallest files are processed first. Is this the most efficient solution? If not, how can the efficiency be improved (simple solutions preferred)?

I remember that sorting like this leads to better efficiency since larger jobs don't block up the pipeline, but aside from examples I can't find or remember any theory behind this, so any references would be greatly appreciated!

Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41
Matthew Turner
  • 3,564
  • 2
  • 20
  • 21
  • I feel like you are confusing `SIMD` with `SPMD` – Marat Dukhan Apr 14 '14 at 19:16
  • I also think you are confusing "efficiency" with "the illusion of progress". If processing N jobs takes M seconds by processing the smallest ones first, it is no more or less efficient than if it takes M seconds processing the biggest ones first. What is different is that more progress appears to be being made at the start. The long jobs still take just as long to finish, though, on an individual basis. GNU parallel does a reasonably good job keeping a specified number of processing streams busy in whatever order you give it... – twalberg Apr 14 '14 at 19:37
  • @twalberg Right on. I'm not confused, though. I am open to the answer being "it doesn't matter, so no, processing the smallest ones first is not the most efficient way". – Matthew Turner Apr 14 '14 at 20:03

1 Answers1

1

If you need to run all jobs and want to optimize for time to complete them all, you want them to finish the same time. In that case you should run the small jobs last. Otherwise you may have the situation where all cpus are done except one that just started on the last big job. Here you will waste CPU time for all CPUs except the one.

Here are 8 jobs: 7 take 1 second, one takes 5:

1 2 3 4 55555 6 7 8

On a dual core small jobs first:

1368
24755555

On a dual core big jobs first:

555557
123468
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Yes! Thank you. To be honest I don't know where I remembered that smaller ones should be first, and I was really looking for an answer like this. – Matthew Turner Apr 14 '14 at 22:46