Accessing Associative Arrays in GNU Parallel

Question

Assume the following in Bash:

declare -A ar='([one]="1" [two]="2" )'
declare -a ari='([0]="one" [1]="two")'
for i in ${!ari[@]}; do 
  echo $i ${ari[i]} ${ar[${ari[i]}]}
done
0 one 1
1 two 2

Can the same be done with GNU Parallel, making sure to use the index of the associative array, not the sequence? Does the fact that arrays can't be exported make this difficult, if not impossible?

Maybe I didn't understand what you were looking for, or maybe you hit the accept button too fast. Let me know :) — rici, Jul 27 '14 at 05:32
Rici - both answers help me parallelize the processing of the arrays, so both answers are acceptable. I didn't realize clicking on the check mark would limit the number of answers. Thanks for the illuminating answer! — Larry, Jul 27 '14 at 08:09

rici · Answer 1 · 2014-11-01T20:09:24.693

Yes, it makes it trickier. But not impossible.

You can't export an array directly. However, you can turn an array into a description of that same array using declare -p, and you can store that description in an exportable variable. In fact, you can store that description in a function and export the function, although it's a bit of a hack, and you have to deal with the fact that executing a declare command inside a function makes the declared variables local, so you need to introduce a -g flag into the generated declare functions.

UPDATE: Since shellshock, the above hack doesn't work. A small variation on the theme does work. So if your bash has been updated, please skip down to the subtitle "ShellShock Version".

So, here's one possible way of generating the exportable function:

make_importer () {
  local func=$1; shift; 
  export $func='() {
    '"$(for arr in $@; do
          declare -p $arr|sed '1s/declare -./&g/'
        done)"'
  }'
}

Now we can create our arrays and build an exported importer for them:

$ declare -A ar='([one]="1" [two]="2" )'
$ declare -a ari='([0]="one" [1]="two")'
$ make_importer ar_importer ar ari

And see what we've built

$ echo "$ar_importer"
() {
    declare -Ag ar='([one]="1" [two]="2" )'
declare -ag ari='([0]="one" [1]="two")'
  }

OK, the formatting is a bit ugly, but this isn't about whitespace. Here's the hack, though. All we've got there is an ordinary (albeit exported) variable, but when it gets imported into a subshell, a bit of magic happens [Note 1]:

$ bash -c 'echo "$ar_importer"'

$ bash -c 'type ar_importer'
ar_importer is a function
ar_importer () 
{ 
    declare -Ag ar='([one]="1" [two]="2" )';
    declare -ag ari='([0]="one" [1]="two")'
}

And it looks prettier, too. Now we can run it in the command we give to parallel:

$ printf %s\\n ${!ari[@]} |
    parallel \
      'ar_importer; echo "{}" "${ari[{}]}" "${ar[${ari[{}]}]}"'
0 one 1
1 two 2

Or, for execution on a remote machine:

$ printf %s\\n ${!ari[@]} |
    parallel -S localhost --env ar_importer \
      'ar_importer; echo "{}" "${ari[{}]}" "${ar[${ari[{}]}]}"'
0 one 1
1 two 2

ShellShock version.

Unfortunately the flurry of fixes to shellshock make it a little harder to accomplish the same task. In particular, it is now necessary to export a function named foo as the environment variable named BASH_FUNC_foo%%, which is an invalid name (because of the percent signs). But we can still define the function (using eval) and export it, as follows:

make_importer () {
  local func=$1; shift; 
  # An alternative to eval is:
  #    . /dev/stdin <<< ...
  # but that is neither less nor more dangerous than eval.
  eval "$func"'() {
    '"$(for arr in $@; do
          declare -p $arr|sed '1s/declare -./&g/'
        done)"'
  }'
  export -f "$func"
}

As above, we can then build the arrays and make an exporter:

$ declare -A ar='([one]="1" [two]="2" )'
$ declare -a ari='([0]="one" [1]="two")'
$ make_importer ar_importer ar ari

But now, the function actually exists in our environment as a function:

$ type ar_importer
ar_importer is a function
ar_importer () 
{ 
    declare -Ag ar='([one]="1" [two]="2" )';
    declare -ag ari='([0]="one" [1]="two")'
}

Since it has been exported, we can run it in the command we give to parallel:

$ printf %s\\n ${!ari[@]} |
    parallel \
      'ar_importer; echo "{}" "${ari[{}]}" "${ar[${ari[{}]}]}"'
0 one 1
1 two 2

Unfortunately, it no longer works on a remote machine (at least with the version of parallel I have available) because parallel doesn't know how to export functions. If this gets fixed, the following should work:

$ printf %s\\n ${!ari[@]} |
    parallel -S localhost --env ar_importer \
      'ar_importer; echo "{}" "${ari[{}]}" "${ar[${ari[{}]}]}"'

However, there is one important caveat: you cannot export a function from a bash with the shellshock patch to a bash without the patch, or vice versa. So even if parallel gets fixed, the remote machine(s) must be running the same bash version as the local machine. (Or at least, either both or neither must have the shellshock patches.)

Note 1: The magic is that the way bash marks an exported variable as a function is that the exported value starts exactly with () {. So if you export a variable which starts with those characters and is a syntactically correct function, then bash subshells will treat it as a function. (Don't expect non-bash subshells to understand, though.)

I have never seen the `make_importer` func before. Can I use it in GNU Parallel's documentation? — Ole Tange, Jul 27 '14 at 07:59
One question - the -exp argument to parallel - is that in long form `--eof=xp`? Doesn't seem like it - is it supposed to be `--env`? — Larry, Jul 27 '14 at 08:29
Do try running parallel with '-vv -S localhost --env ar_importer'. How many \ do you see :-) — Ole Tange, Jul 27 '14 at 08:59
@OleTange: Yep, `-exp` was a typo. Shows my inexperience with parallel. Fixed, thanks. You're certainly free to reproduce the hack. You can even give me credit; I think I've committed enough hacks in my life that one more won't embarrass me too much. I've never used it with parallel, but I sometimes need to export arrays into subshells; I don't think I've ever published it before. Obviously you only want to run ar_importer once in a target shell, which requires a deeper understanding of parallel options than I possess, but I'm sure you'll have no trouble optimizing it. — rici, Jul 27 '14 at 16:51
It doesn't seem possible, but then effectively exporting arrays _readonly_ (if I understand correctly) didn't seem possible, but is there any chance another hack may allow the associative array to be updated and saved in the original associative array? I'm guessing the only way would be to have parallel emit assignment statements that are eval'd afterwards. Thanks very much for your answers! — Larry, Jul 27 '14 at 18:26
@Larry: as far as I know, bash doesn't provide any mechanism to trigger an action when an associative array is modified. But if you could solve that problem, then you could distribute the modifications as you suggest. Another possibility would be to insert the `declare` statements into a file instead of creating the function, and distribute the file using an rsync-like mechanism. In general, the shell approach to inheriting variables is read-only and based on copying. I've occasionally (mis)used technologies like memcached and redis to keep shells in sync with each other. — rici, Jul 27 '14 at 18:40
@OleTange: I ended up with this: `printf %s\\n ${!ari[@]} | parallel -S localhost -mj4 --xargs --env ar_import bash -c "'"'ar_import; for i in "${@}"; do echo "$i" "${ari[$i]}" "${ar[${ari[$i]}]}"; done'"'" _` which works up to about 600 elements in the arrays. After that, it runs out of room for the command-line, which is about 3.5 times as large as the dumped arrays, thanks in part to all those backslashes you mentioned but also because it seems to be sending the environment variable twice. So it's not terribly scalable -- a file would be better -- but it's possible. — rici, Jul 27 '14 at 22:52
GNU Parallel sends the environment variables twice as it cannot know in advance if the receiving shell is (t)csh or (ba)sh. Theoretical best case would be 2*3.5*600 = 4200. The difference between 600 and 4200 is so small that GNU Parallel will not be changed to increase the 600. Especially since the typical use of GNU Parallel is to process one arg per job and not 600 per job (i.e. remove --xargs and the 'bash -c for-loop'). — Ole Tange, Jul 27 '14 at 23:08
This (make_importer), of course, no longer works in bash after the shellshock fix. Any new way to do this would be appreciated. — Larry, Oct 31 '14 at 21:44
@Larry: Ok, I added a version which works locally with an de-shellshocked bash. I'll have to see if there is a new parallel or if I can suggest a patch to parallel to allow it to work on remote machines as well. I'm trying hard to avoid expressing an opinion about the shellshock fix. — rici, Nov 01 '14 at 20:11
A further developed version of this hack is today an integral part of env_parallel (a wrapper for GNU Parallel that exports the full environment). — Ole Tange, Mar 29 '16 at 07:06

score 1 · Answer 2 · answered Jul 27 '14 at 04:29

GNU Parallel is a perl program. If the perl program cannot access the variables, then I do not see a way that the variables can be passed on by the perl program.

So if you want to parallelize the loop I see two options:

declare -A ar='([one]="1" [two]="2" )'
declare -a ari='([0]="one" [1]="two")'
for i in ${!ari[@]}; do 
  sem -j+0 echo $i ${ari[i]} ${ar[${ari[i]}]}
done

The sem solution will not guard against mixed output.

declare -A ar='([one]="1" [two]="2" )'
declare -a ari='([0]="one" [1]="two")'
for i in ${!ari[@]}; do 
  echo echo $i ${ari[i]} ${ar[${ari[i]}]}
done | parallel

Ole Tange · Accepted Answer · 2019-03-25T06:26:55.470

A lot has happened in 4 years. GNU Parallel 20190222 comes with env_parallel. This is a shell function that makes it possible to export the most of the environment to the commands run by GNU Parallel.

It is supported in ash, bash, csh, dash, fish, ksh, mksh, pdksh, sh, tcsh, and zsh. The support varies from shell to shell (see details on https://www.gnu.org/software/parallel/env_parallel.html). For bash you would do:

# Load the env_parallel function
. `which env_parallel.bash`
# Ignore variables currently defined
env_parallel --session
[... define your arrays, functions, aliases, and variables here ...]
env_parallel my_command ::: values
# The environment is also exported to remote systems if they use the same shell
(echo value1; echo value2) | env_parallel -Sserver1,server2 my_command
# Optional cleanup
env_parallel --end-session

So in your case something like this:

env_parallel --session
declare -A ar='([one]="1" [two]="2" )'
declare -a ari='([0]="one" [1]="two")'
foo() {
  for i in ${!ari[@]}; do 
    echo $i ${ari[i]} ${ar[${ari[i]}]}
  done;
}
env_parallel foo ::: dummy
env_parallel --end-session

As you might expect env_parallel is a bit slower than pure parallel.

Wait, what? I want parallel to do the loop, not have parallel (or env_parallel) just run a function that does the loop. This works too: env_parallel 'echo {} ${ari[{}]} ${ar[${ari[{}]}]}' ::: ${!ari[@]} — Larry, Mar 26 '19 at 02:56

Accessing Associative Arrays in GNU Parallel

3 Answers3

ShellShock version.

Linked