Combining many files with matching fields in particular column to a single file

Question

So I have 128 files with two columns. I want to match them by the values in the first column and add the values in the second column from each file to a single file.

I was able to kinda of find a solution here:

From: https://unix.stackexchange.com/questions/159961/merging-2-files-with-based-on-field-match

awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' file2 file1

It does what I want, however I need for this to go through every file in the folder.

Is there away to make this command loop through all the files in the folder or is there a better method all together?

Example: Input

File 1:
    gene_id normalized_count
    A1BG|1  42.3332
    A1CF|29974  165.6696
    A2BP1|54715 0.0000
    A2LD1|87769 138.1270
    A2ML1|144568    2.7612
    A2M|2   7310.6121
    A4GALT|53947    348.3663
    A4GNT|51146 0.0000


File 2:
    gene_id normalized_count
    A1BG|1  18.2019
    A1CF|29974  129.6194
    A2BP1|54715 2.2063
    A2LD1|87769 65.3116
    A2ML1|144568    0.0000
    A2M|2   3415.8632
    A4GALT|53947    83.2874
    A4GNT|51146 0.0000


File 3:
    gene_id normalized_count
    A1BG|1  8.6285
    A1CF|29974  97.6385
    A2BP1|54715 0.0000
    A2LD1|87769 200.5540
    A2ML1|144568    0.0000
    A2M|2   984.0736
    A4GALT|53947    24.0690
    A4GNT|51146 0.4541

Desired output

        gene_id normalized_count        
        A1BG|1  42.3332 18.2019 8.6285
        A1CF|29974  165.6696    129.6194    97.6385
        A2BP1|54715 0   2.2063  0
        A2LD1|87769 138.127 65.3116 200.554
        A2ML1|144568    2.7612  0   0
        A2M|2   7310.6121   3415.8632   984.0736
        A4GALT|53947    348.3663    83.2874 24.069
        A4GNT|51146 0   0   0.4541

For the desired output I don't care how the column labels end up looking.

Again my problem is that I have to do this for hundreds of files at once to produce one file.

Here are some other similar problems with solutions https://unix.stackexchange.com/questions/122919/merge-2-files-based-on-all-values-of-the-first-column-of-the-first-file

https://unix.stackexchange.com/questions/113879/how-to-merge-two-files-with-different-number-of-rows-in-shell

But they only had to do this for a few files.

Edit: both Nathan's and joepd worked and produced similar output Thank you!

Nathan's solution will produce output space delimited

joepd's will produce output that had the header (with original tab separated), and the first column separated by two spaces and the rest space delimited.

Could you provide a small sample input and desired output? Do you want the output to have an additional column for every input file? — Nathan Wilson, Jul 15 '15 at 03:12
The input and output is the same as the link to the other question. Yes each file will add a new column. So 64 files each with two columns gets collapsed into 65 columns. I can provide the exact input and desired output once I am near a computer tomorrow. — Talon, Jul 15 '15 at 04:15

joepd · Accepted Answer · 2015-07-15T17:16:16.777

0

You will need gawk for this:

gawk '{a[$1]+=$2}; END{ for (i in a) print i, a[i]}' files*

If this does not work for you, please specify input and output.

EDIT

After your specification it becomes clear that you want to concatenate the strings. How about this?

awk '
    NR==1  {title=$0}
    FNR!=1 {a[$1] = a[$1]" "$2}
    END {
        print title
        for (i in a)
            print i, a[i]
    }
' files*

edited Jul 15 '15 at 17:16

answered Jul 15 '15 at 06:35

joepd

4,681
2
26
27

I added the input and desired output. This command didn't seem to work. I tried running it on just two files and it only processes the first file and stops. – Talon Jul 15 '15 at 16:20

score 0 · Answer 2 · answered Jul 15 '15 at 17:28

This should produce the output you want with one more column in the output for each file in the input:

awk 'FNR>2{a[$1]=a[$1] " " $2}; END{ for (i in a) print i a[i]}' File*

It's structured like @joepd's answer which numerically sums the inputs instead of string concatenating them.

FNR>2 is used to ignore the header lines in each file.

Combining many files with matching fields in particular column to a single file

2 Answers2