8

I'm evaluating if GNU Parallel can be used to search files stored on a system in parallel. There can be only one file for each day of year (doy) on the system (so a maximum of 366 files per year). Let's say there are 3660 files on the system (about 10 years worth of data). The system could be a multi-CPU multi-core Linux or a multi-CPU Solaris.

I'm storing the search commands to run on the files in an array (one command per file). And this is what I'm doing right now (using bash) but then I have no control on how many searches to start in parallel (definitely don't want to start all 3660 searches at once):

#!/usr/bin/env bash
declare -a cmds
declare -i cmd_ctr=0

while [[ <condition> ]]; do
    if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
      cmds[$cmd_ctr]="<cmd_to_run>"
      let cmd_ctr++
    fi
done

declare -i arr_len=${#cmds[@]}
for (( i=0; i<${arr_len}; i++ ));
do
  # Get the command and run it in background
  eval ${cmds[$i]} &
done
wait

If I were to use parallel (which will automatically figure out the max. CPUs/cores and start only so many searches in parallel), how can I reuse the array cmds with parallel and rewrite the above code? The other alternative is to write all commands to a file and then do cat cmd_file | parallel

Say No To Censorship
  • 537
  • 1
  • 15
  • 32
  • To be pedantic, in my universe 10 years cannot yield 3660 files since there cannot be 10 consecutive leap years. But since you wrote "about" I assume you know that and don't glance into mine from a parallel universe (which saddens me a bit) ;-) – Adrian Frühwirth May 07 '13 at 22:22
  • @Adrian You are right; I added 'about' to account for leap years :) – Say No To Censorship May 07 '13 at 22:43

1 Answers1

9

https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Using-shell-variables says:

parallel echo ::: "${V[@]}"

You do not want the echo, so:

parallel ::: "${cmds[@]}"

If you do not need $cmds for anything else, then use 'sem' (which is an alias for parallel --semaphore) https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Working-as-mutex-and-counting-semaphore

while [[ <condition> ]]; do
  if [[ -s $cur_archive_path/log.${doy_ctr} ]]; then
    sem -j+0 <cmd_to_run>
  fi
done
sem --wait

You have not described what <condition> might be. If you are simply doing a something like a for-loop you could replace the whole script with:

parallel 'if [ -s {} ] ; then cmd_to_run {}; fi' ::: $cur_archive_path/log.{1..3660}

(based on https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Composed-commands).

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thanks for all the tips. But I have a feeling, the shell will blow up if "${cmds[@]}" is expanded in-line especially if the cmds array has 1000 elements/commands in it? Think it's safer to feed the commands from a file? – Say No To Censorship May 15 '13 at 20:07
  • Also, when `${cmds[@]}` is expanded what is the delimiter between multiple commands (do I need to use a `;` at the end of every command)? How is this different from `cat cmd_file | parallel` where I suppose a new line character is considered a command separator? – Say No To Censorship May 15 '13 at 20:25
  • I can run command lines of 130KB, so if your command is < 130 char, you should be safe. But personally I would simply either pipe the commands to parallel (thus avoiding both any shell limit and a temporary file) or let parallel generate the commands. – Ole Tange May 16 '13 at 08:27
  • The delimiter is the array element. So you can put anything in each element - no need to end with a ;. The primary difference is the command line length (which you clearly are aware of). But why not just try it out: Run some 'echo' commands that are of similar length and content as your real commands. That should make you feel more confident that it will work. – Ole Tange May 16 '13 at 08:28
  • One more question, is there a way I can limit the total # of lines output when using Parallel? E.g., each of the data file contains 30,000 records and but then there 10 of these files. I want to search all 10 files in parallel but don't want the search to return 100,000 records (assuming 10,000 records from each search) too much for the client to handle. I can possibly send all results to a file and then do a `head -n 30000` and tell where the file is located/make it available for download. Or can I simply do: `cat cmd_file | parallel -j30% | head -n 30000`? – Say No To Censorship May 21 '13 at 12:26