2

So I have this function in BASH that I'm trying to understand - and it uses parallelism:

function get_cache_files() {
    ## The maximum number of parallel processes. 16 since the cache
    ## naming scheme is hex based.
    local max_parallel=${3-16}
    ## Get the cache files running grep in parallel for each top level
    ## cache dir.
    find $2 -maxdepth 1 -type d | xargs -P $max_parallel -n 1 grep -Rl "KEY:.*$1" | sort -u
} # get_cache_files

So my questions:

  1. The comment: "16 since the cache naming scheme is hex based" - naming example is this: php2-mindaugasb.c9.io/5c/c6/348e9a5b0e11fb6cd5948155c02cc65c - why is it important to use 16 processes when the naming scheme is HEX based (hexadecimal system)?
  2. The -P option for XARGS is for max-procs:

Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the -n option with -P; otherwise chances are that only one exec will be done.

Ok, so: "xargs -P $max_parallel -n 1" is correct and 16 processes will be initiated? Or should n be equal to $max_parallel also?

  1. As I understand the conditions to parallelise are:

    1. Independence of resources on which the operations will be performed (like similar files on which the operations will be performed);
    2. Operations are performed on independent computers;

    What are other conditions, circumstances when you can parallelise?

UltraInstinct
  • 43,308
  • 12
  • 81
  • 104
Mindaugas Bernatavičius
  • 3,757
  • 4
  • 31
  • 58

1 Answers1

4

Ok, so: "xargs -P $max_parallel -n 1" is correct and 16 processes will be initiated? Or should n be equal to $max_parallel also?

Think of several bill counters in a store and a huge number customers waiting to pay the bill. -P in analogy would be the number of bill counters (here, 16). -n would be the number of customers one counter is able to handle at a time (here, 1). In this case, its easy to picture approximately equal sized queues on each counter, right?

From the perspective of the question, max_parallel=${3-16} means that the variable is set to 16 if the $3 argument is not passed to the function. xargs launches 16 (-P parameter) parallel processes of grep. Each of the processes gets exactly one line (-n parameter) from the stdin of the xargs as the last command line parameter. In this case, the stdin of xargs is the output of the find command. Overall, the find command is going to list all the directories, the output of it is going to get consumed by 16 grep processes line by line. Each grep process will be invoked as:

grep -R1 "KEY:.*$1" <one line from find-output/xargs-input>

The comment: "16 since the cache naming scheme is hex based" - naming example is this: php2-mindaugasb.c9.io/5c/c6/348e9a5b0e11fb6cd5948155c02cc65c - why is it important to use 16 processes when the naming scheme is HEX based (hexadecimal system)?

I can not make out the logic behind this; but I think its more to do distribution and volume of data. If the total number of output lines from find is a multiple of 16, then it probably makes some sense.

UltraInstinct
  • 43,308
  • 12
  • 81
  • 104
  • I like this. One thing to note is that `max_parallel=${3-16}` essentially means that `max_parallel` is set to `$3` if passed else defaulted to `16` – iruvar Jan 02 '15 at 15:12
  • @1_CR Ohh, I missed writing about that part. Let me add it to my answer. Thanks! – UltraInstinct Jan 02 '15 at 15:13
  • Yes, I can't make out the logic behind she specific number of parallels either - the output is not skewed towards n*16 as far as I see and it uses sort for uniqueness "sort -u" at the end to eliminate the duplicates - can you tell me maybe why use parallelism if it produces duplicates? Seems to make no sense to me – Mindaugas Bernatavičius Jan 05 '15 at 08:05