Process groups of file pairs from multiple directories

Question

I have some .txt files in dir1:

file_name_FOO31101.txt
file_name_FOO31102.txt
file_name_FOO31103.txt
file_name_FOO31104.txt

and some related foo.txt files in dir2:

file_name_FOO31101_foo.txt
file_name_FOO31102_foo.txt
file_name_FOO31103_foo.txt
file_name_FOO31104_foo.txt

I ultimately want to be able to call a program for pairs of files such that:

Iteration 1

program_call \
    --txt file_name_FOO31101.txt,file_name_FOO31102.txt \
    --foo file_name_FOO31101_foo.txt,file_name_FOO31102_foo.txt \
    --bar file_name_FOO31101_bar.txt,file_name_FOO31102_bar.txt

Iteration 2

program_call \
        --txt file_name_FOO31103.txt,file_name_FOO31104.txt \
        --foo file_name_FOO31103_foo.txt,file_name_FOO31104_foo.txt \
        --bar file_name_FOO31103_bar.txt,file_name_FOO31104_bar.txt

I.e.
file_name_FOO31101.txt,file_name_FOO31102.txt
file_name_FOO31103.txt,file_name_FOO31104.txt
but not
file_name_FOO31102.txt,file_name_FOO31103.txt

An answer from a question I posted yesterday got me started:

#!/bin/bash

txt_files=/path/to/txt
foo_files=/path/to/foo/files

set -- "$txt_files"/*.txt

[[ -e $1 || -L $1 ]] || { echo "No .txt files found in $txt_files" >&2; exit 1; }

# $# = number of command line arguments passed to the script
while (( $# > 1 )); do

  stem=$(basename "${1}" )
  output_base=$(echo $stem | cut -d '_' -f 1,2,3) # split on '_' and save ID  

  echo "-> Processing pairs of txt files : $1,$2"

  # Add files to array
  txt1+=($1)
  txt2+=($2)

  shift; shift

done

(( $# )) && echo "Left over file $1 still exists"

And then (not knowing a better way of doing this) I repeat the same loop for the foo files in dir2:

set -- "$foo_files"/*_foo.txt

[[ -e $1 || -L $1 ]] || { echo "No foo.txt files found in $foo_files" >&2; exit 1; }

# $# = number of command line arguments passed to the script
while (( $# > 1 )); do

  stem=$(basename "${1}" )
  output_base=$(echo $stem | cut -d '_' -f 1,2,3) # split on '_' and save ID

  # Add files to array
  foo1+=($1)
  foo2+=($2)

  echo "-> Processing pairs of foo.txt files : $1,$2"

  shift; shift

done

(( $# )) && echo "Left over file $1 still exists"

And then iterate over one of the arrays (all must be the same length) and call program:

# Seeing as all arrays must be the same length, loop over one and print out corresponding values for others 
for ((i=0;i<${#txt1[@]};++i)); do
    printf "program_call --txt %s,%s --foo %s,%s\n" "${txt1[i]}" "${txt2[i]}" "${foo1[i]}" "${foo2[i]}" 
done

Which seems to basically work, printing:

program_call --txt /path/to/txt/file_name_FOO31101.txt,/path/to/txt/file_name_FOO31102.txt --foo /path/to/foo/files/file_name_FOO31101_foo.txt,/path/to/foo/files/file_name_FOO31102_foo.txt
program_call --txt /path/to/txt/file_name_FOO31103.txt,/path/to/txt/file_name_FOO31104.txt --foo /path/to/foo/files/file_name_FOO31103_foo.txt,/path/to/foo/files/file_name_FOO31104_foo.txt

However, I suspect that using the same while loop for all different dirs is a poor way of achieving this result, particularly if I want to call add more options in my program call (e.g. file_name_FOO31101_bar.txt ...).

Is this a sensible way of going about this?

Is there a question? I see things working well – Inian Jan 13 '17 at 10:55 — Inian, Jan 13 '17 at 10:55

score 0 · Answer 1 · answered Jan 14 '17 at 18:57

you intuition is correct: there are faster ways than bash loops and arrays.

here's how to list and sort the files in both directories:

find txt foo -type f -name "*.txt" | sort -t'/' -k2,2

output:

txt/a_0001.txt
foo/a_0001_foo.txt
txt/a_0002.txt
foo/a_0002_foo.txt
txt/a_0003.txt
foo/a_0003_foo.txt
txt/a_0004.txt
foo/a_0004_foo.txt
...

next, assuming that there are no extra or missing files in either of the directories, you can get 4/line with awk:

find txt foo -type f -name "*.txt" | sort -t'/' -k2,2 |
  awk '{printf $1" "; if(NR%4==0)printf "\n"}'

output:

txt/a_0001.txt foo/a_0001_foo.txt txt/a_0002.txt foo/a_0002_foo.txt 
txt/a_0003.txt foo/a_0003_foo.txt txt/a_0004.txt foo/a_0004_foo.txt 
txt/a_0005.txt foo/a_0005_foo.txt txt/a_0006.txt foo/a_0006_foo.txt 
...

next, you could use another awk to re-order them and make the command strings:

find txt foo -type f -name "*.txt" | sort -t'/' -k2,2 |
  awk '{printf $1" "; if(NR%4==0)printf "\n"}' |
  awk '{print "program_call --txt "$1","$3" --foo "$2","$4}'

output:

program_call --txt txt/a_0001.txt,txt/a_0002.txt --foo foo/a_0001_foo.txt,foo/a_0002_foo.txt
program_call --txt txt/a_0003.txt,txt/a_0004.txt --foo foo/a_0003_foo.txt,foo/a_0004_foo.txt
...

benchmark to make 500 command strings from 2000 files with fugu's code vs find|sort|awk|awk:

bash loops & arrays    10.070s
find|sort|awk|awk       0.019s

that's over 500x as fast :)

you can also save time by using pipes instead of loops to run the command strings:

find txt foo -type f -name "*.txt" | ... | sh

and usually even more time by piping commands instead to GNU parallel:

find txt foo -type f -name "*.txt" | ... | parallel

(you may have to install parallel if it's not already on your system.)

Process groups of file pairs from multiple directories

1 Answers1