4

using --pipe -N<int> I can send a given number of lines as an input of job started by parallel. But how can I accomplish to run several jobs with different arguments given with ::: on each chunk?

Let's take this little input file:

A   B   C
D   E   F
G   H   I
J   K   L

Furthermore let's define to pipe every two line to a parallel job. And on them the command cut -f<int> should be executed with column number given as input arguments to parallel like ::: {1..3}

So for the given example the output would look like this

A
D
B
E
C
F
G
J
H
K
I
L

I'v tried this command:

cat input.txt|parallel --pipe -N2 'cut -f{1}' ::: {1..3}

But the output is this:

A
D
I
L

What I'm missing?

fin swimmer

finswimmer
  • 10,896
  • 3
  • 34
  • 44

1 Answers1

4

This:

cat input.txt|parallel --pipe -N2 'cut -f{1}' ::: {1..3}

reads 2 records from each input source. It is more clear if you do:

$ cat input.txt|parallel --pipe -v -N2 'cut -f{}' ::: {1..3}
cut -f1  -f2
cut: only one type of list may be specified
Try 'cut --help' for more information.
cut -f3
I
L

GNU Parallel pairs each argument with a block. What you are looking for is more like --tee where every block is sent to every command. --tee, however, does not chop input into blocks, but sends all input to the command. So maybe we can combine the two:

doit() { parallel --pipe -N2 -v cut -f$@; }
export -f doit
cat input.txt|parallel --pipe --tee -v doit {} ::: {1..3}

Or you can flip the order (this is probably less efficient):

doit() { parallel -v --pipe --tee cut -f{} ::: {1..3}; }
export -f doit
cat input.txt|parallel --pipe -N2 -v doit

Remove -v when you are happy with what is being run.

--tee is very efficient (1-2 GBytes/s with --pipe, 2-3 GBytes/s with --pipepart), but it has the disadvantage, that it starts all jobs in parallel: So if you instead of {1..3} have 10000 values, then it will start 10000 processes.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104