Joining multiple files in Linux

Question

I have the following files (underscores represent tab delimiter, filenames not included in file contents):

(sample001.file)

Name_____scores_____gender   
Joey_____54_____Boy  
Kyle_____87_____Girl  
Sia______43_____Girl  
Marge____87_____Girl

(sample002.file)

Name_____scores_____gender   
Joey_____23_____Boy  
Pedro____76_____Boy  
Kyle_____76_____Girl

(sample003.file)

Name_____scores_____gender   
Kyle_____34_____Girl  
James____65_____Boy  
Pedro____76_____Boy  
Sia______65_____Girl  
Marge____23_____Girl

I would like all of these files integrated into one, containing only first- and second-column data. It would look like this:

(integrate.file)

Name_____sample001____sample002_____sample003  
Joey_____54_____23____0  
Kyle_____87_____76____34  
Sia______43_____0_____65  
Marge____87_____0_____23  
Pedro____0______76____76  
James____0______0_____65

Basically, names should only have a single entry in the first column, and that if there is no data on any sample, it should be zero. Headers are not necessary, but they can be present.

Can anyone help me on this?

See, amongst others, [Join all files in a directory](http://stackoverflow.com/questions/3903868) and [What is the simplest method to join columns from a variable number of files?](http://stackoverflow.com/questions/18466662). — Jonathan Leffler, Jul 21 '16 at 05:06

Jonathan Leffler · Accepted Answer · 2016-07-31T17:06:47.063

Using Bash and process substitution, you can do it it for three files in a single (fairly long) command pipeline:

join -e 0 -a 1 -a 2 -t $'\t' -o 0,1.2,2.2 \
     <(sed 1d sample001.file | sort) \
     <(sed 1d sample002.file | sort) |
join -e 0 -a 1 -a 2 -t $'\t' -o 0,1.2,1.3,2.2 \
   - <(sed 1d sample003.file | sort)

Note that join requires its inputs to be sorted on the join column, which is column 1 in this case. The sed 1d commands delete the heading line before sorting the data.

The -e0 says 'put 0 when values are missing. The -a1 and -a2 options say 'preserve all the rows from file 1 and file 2'. The -t $'\t' option uses Bash's ANSI C Quoting to generate a tab for the separator. If you omit the -t option, it 'works' but the output columns are separated by blanks, not tabs. The -o options specify the columns to print: 0 is the joining column (column 1 in each file); and 1.2 is column 2 from file 1, etc. The file name - in the second join means 'read standard input'.

The output from that on the sample data is:

James   0       0       65
Joey    54      23      0
Kyle    87      76      34
Marge   87      0       23
Pedro   0       76      76
Sia     43      0       65

Here's some code that handles 10 sample files. I needed to generate the data too, so I used a number of tools from my toolkit to do that — random, perturb, range (very similar to standard seq) and shuffle:

for sample in $(range -f '%03d' 1 10)
do
    random -n 9 -T '%{ABCDEFJKMPS}s %[11:90]d   %{BG}s' |
    sort -u -k1,1 |
    join -o 1.2,2.2,2.3 names - |
    shuffle |
    sed 's/ /   /g' |
    perturb -f '%2.0f' -p 10 -c 2 > "sample$sample.file"
done

One minor problem with the random data generator is it doesn't yet allow you to choose a random entry from a list of (multi-character) names, so I used a list of initials and mapped those to names with the names file. It's a tad weird, but you should have your data already and shouldn't need to generate random data. The file names contained:

A Alex
B Belle
C Cynthia
D Doreen
E Elizabeth
F Ferdinand
J James
J Joey
K Kyle
M Marge
P Pedro
S Sia

For example, sample001.file ended up containing:

Belle   81      B
Marge   62      B
Ferdinand       37      B
Sia     44      B
Doreen  45      G
Elizabeth       18      G
Joey    16      B
James   19      B

Then the joining code needs to generate the list of all names before doing any joining, else you don't see proper scores for names that don't appear in the first sample file. This doesn't use any non-standard tools.

tmp=$(mktemp ./tmp.XXXXXX)
trap 'rm -f "$tmp" "$tmp".?; exit 1' 0 1 2 3 13 15

sed 's/[[:space:]].*//' "$@" | sort -u > $tmp.0

join_cmd()
{
    join -e 0 -a 1 -a 2 -o "$outcols" "$@" > "$tmp.2"
}

outcols="0,2.2"
# Generate list of all names
join_cmd "$tmp.0" <(sort "$1")
mv "$tmp.2" "$tmp.1"
shift
outcols="0,1.2,2.2"  

for sample in "$@"
do
    join_cmd "$tmp.1" <(sort "$sample")
    sed 's/[[:space:]]\([0-9][0-9]*\)$/,\1/' "$tmp.2" > "$tmp.1"
done

# Don't hard code the output file name — do that on the command line that
# invokes this script (same as you specify the input file names on the command line).
sed 's/,/    /g' "$tmp.1" # > integrate.file

rm -f "$tmp" "$tmp".?
trap 0 1 2 3 13 15

This gets away without having to continually extend the list of joining columns by mapping the numbers into a comma-separated list.

$ column -t integrate.file
Alex       0   0   78  0   65  21  0   38  64  0
Belle      81  12  15  58  0   27  0   13  0   52
Cynthia    0   58  0   52  12  0   0   77  0   94
Doreen     45  49  0   85  0   0   57  32  81  63
Elizabeth  18  64  19  39  18  94  52  0   0   25
Ferdinand  37  0   0   0   0   64  72  21  0   28
James      19  0   0   77  0   48  78  59  39  23
Joey       16  0   0   79  0   48  78  70  39  19
Kyle       0   80  0   65  54  26  0   88  0   0
Marge      62  37  13  0   0   81  0   0   24  69
Pedro      0   0   40  0   47  74  79  0   0   0
Sia        44  0   27  0   55  0   43  0   32  0
$

You can see the contents of sample000.file in column 2 of the output. And you can see that all the names in names appear in the output, and there is a number for each sample file.

Thank you very much. Although, I am going to merge a lot of files. How can I be able to do this? — Dale Pin, Jul 21 '16 at 05:42
See the [comment](http://stackoverflow.com/questions/38494701/joining-multiple-files-in-linux/38495499?noredirect=1#comment64390999_38494701) I made to the question — those are two questions that cover multiple files. The basic technique is join 2 files and create a new one; then join the new file and another original to create yet another file, repeating until all files are processed. Another option would be to generate a shell script analogous to the one shown in this answer, adding a new `join` for each extra data file. — Jonathan Leffler, Jul 21 '16 at 05:44
You might want to think about whether you can simplify the join processing by using, for example, blanks between the numbers and a tab only between the name and the first number. This would allow you to have the same join list in each set of commands. Or you could decide to move to Perl or Python for the whole job. — Jonathan Leffler, Jul 21 '16 at 05:48

Joining multiple files in Linux

1 Answers1