7

I have a program that generates lots (terabytes) of output and sends it to stdout.

I want to split that output and process it in parallel with a bunch of instances of another program. It can be distributed in any way, as long as the lines are left intact.

Parallel can do this, but it takes a fixed number of lines and restartes the filter process after this:

./relgen | parallel -l 100000 -j 32 --spreadstdin ./filter

Is there a way to keep a constant number of processes running and distribute data among them?

Cyrus
  • 84,225
  • 14
  • 89
  • 153
Craden
  • 145
  • 6

1 Answers1

2

-l is no good for performance. Use --block instead if possible.

You can have the data distributed roundrobin with: --roundrobin.

./relgen | parallel --block 3M --round-robin -j 32 --pipe ./filter
Ole Tange
  • 31,768
  • 5
  • 86
  • 104