Using Bash and process substitution, you can do it it for three files in a single (fairly long) command pipeline:
join -e 0 -a 1 -a 2 -t $'\t' -o 0,1.2,2.2 \
<(sed 1d sample001.file | sort) \
<(sed 1d sample002.file | sort) |
join -e 0 -a 1 -a 2 -t $'\t' -o 0,1.2,1.3,2.2 \
- <(sed 1d sample003.file | sort)
Note that join
requires its inputs to be sorted on the join column, which is column 1 in this case. The sed 1d
commands delete the heading line before sorting the data.
The -e0
says 'put 0
when values are missing. The -a1
and -a2
options say 'preserve all the rows from file 1 and file 2'. The -t $'\t'
option uses Bash's ANSI C Quoting to generate a tab for the separator. If you omit the -t
option, it 'works' but the output columns are separated by blanks, not tabs. The -o
options specify the columns to print: 0
is the joining column (column 1 in each file); and 1.2
is column 2 from file 1, etc. The file name -
in the second join
means 'read standard input'.
The output from that on the sample data is:
James 0 0 65
Joey 54 23 0
Kyle 87 76 34
Marge 87 0 23
Pedro 0 76 76
Sia 43 0 65
Here's some code that handles 10 sample files. I needed to generate the data too, so I used a number of tools from my toolkit to do that — random
, perturb
, range
(very similar to standard seq
) and shuffle
:
for sample in $(range -f '%03d' 1 10)
do
random -n 9 -T '%{ABCDEFJKMPS}s %[11:90]d %{BG}s' |
sort -u -k1,1 |
join -o 1.2,2.2,2.3 names - |
shuffle |
sed 's/ / /g' |
perturb -f '%2.0f' -p 10 -c 2 > "sample$sample.file"
done
One minor problem with the random data generator is it doesn't yet allow you to choose a random entry from a list of (multi-character) names, so I used a list of initials and mapped those to names with the names
file. It's a tad weird, but you should have your data already and shouldn't need to generate random data.
The file names
contained:
A Alex
B Belle
C Cynthia
D Doreen
E Elizabeth
F Ferdinand
J James
J Joey
K Kyle
M Marge
P Pedro
S Sia
For example, sample001.file
ended up containing:
Belle 81 B
Marge 62 B
Ferdinand 37 B
Sia 44 B
Doreen 45 G
Elizabeth 18 G
Joey 16 B
James 19 B
Then the joining code needs to generate the list of all names before doing any joining, else you don't see proper scores for names that don't appear in the first sample file. This doesn't use any non-standard tools.
tmp=$(mktemp ./tmp.XXXXXX)
trap 'rm -f "$tmp" "$tmp".?; exit 1' 0 1 2 3 13 15
sed 's/[[:space:]].*//' "$@" | sort -u > $tmp.0
join_cmd()
{
join -e 0 -a 1 -a 2 -o "$outcols" "$@" > "$tmp.2"
}
outcols="0,2.2"
# Generate list of all names
join_cmd "$tmp.0" <(sort "$1")
mv "$tmp.2" "$tmp.1"
shift
outcols="0,1.2,2.2"
for sample in "$@"
do
join_cmd "$tmp.1" <(sort "$sample")
sed 's/[[:space:]]\([0-9][0-9]*\)$/,\1/' "$tmp.2" > "$tmp.1"
done
# Don't hard code the output file name — do that on the command line that
# invokes this script (same as you specify the input file names on the command line).
sed 's/,/ /g' "$tmp.1" # > integrate.file
rm -f "$tmp" "$tmp".?
trap 0 1 2 3 13 15
This gets away without having to continually extend the list of joining columns by mapping the numbers into a comma-separated list.
$ column -t integrate.file
Alex 0 0 78 0 65 21 0 38 64 0
Belle 81 12 15 58 0 27 0 13 0 52
Cynthia 0 58 0 52 12 0 0 77 0 94
Doreen 45 49 0 85 0 0 57 32 81 63
Elizabeth 18 64 19 39 18 94 52 0 0 25
Ferdinand 37 0 0 0 0 64 72 21 0 28
James 19 0 0 77 0 48 78 59 39 23
Joey 16 0 0 79 0 48 78 70 39 19
Kyle 0 80 0 65 54 26 0 88 0 0
Marge 62 37 13 0 0 81 0 0 24 69
Pedro 0 0 40 0 47 74 79 0 0 0
Sia 44 0 27 0 55 0 43 0 32 0
$
You can see the contents of sample000.file
in column 2 of the output. And you can see that all the names in names
appear in the output, and there is a number for each sample file.