Horizontal merging of columns from multiple files

Question

I tried to write a script that takes two columns each from multiple files and concatenates them together horizontally. The problem is, the content of the columns is not in the same order in the files, so the data needs to be sorted before concatenating.

This is what I've come up with so far:

!/bin/bash

ls *.txt > list

while read line; do
    awk '{print $2}' "$line" > f1
    awk '{print $8}' "$line" > f2
    paste f1 f2 | sort > "$line".output
done < list

ls *.output > list2

head -n 1 list2 > start

while read line; do
    cat "$line" > output
done < start

tail -n +2 list2 > list3

while read line; do
    paste output "$line" | cat > output
done < list3

My programing is probably not that efficient, but it does what I want it to do, with the exception of the second last line, which does not concatenate the files together properly. If I enter the line in the command-line it works fine, but in the while loop it misses columns.

The data files look like this:

bundle_id   target_id   length  eff_length  tot_counts  uniq_counts est_counts  eff_counts  ambig_distr_alpha   ambig_distr_beta    fpkm    fpkm_conf_low   fpkm_conf_high  solvable    tpm
1   comp165370_c0_seq1  297 0.000000    0   0   0.000000    0.000000    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    F   0.000000e+00
2   comp75418_c0_seq1   1371    852.132325  35  0   0.005490    0.008832    8.287807e-04    5.283100e+00    4.583199e-04    0.000000e+00    2.425095e-02    T   6.225299e-04
3   comp76235_c0_seq1   1371    871.645349  44  9   43.994510   69.198412   2.002884e+00    3.142003e-04    3.590738e+00    3.516301e+00    3.665174e+00    T   4.877251e+00
4   comp31034_c0_seq1   379 251.335522  14  0   7.049180    10.629771   1.000000e+00    1.000000e+00    1.995307e+00    0.000000e+00    5.957982e+00    F   2.710199e+00
5   comp36102_c0_seq1   379 234.689179  14  0   6.950820    11.224893   1.000000e+00    1.000000e+00    2.107017e+00    0.000000e+00    6.350761e+00    F   2.861933e+00
6   comp26522_c0_seq1   220 0.000000    0   0   0.000000    0.000000    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    F   0.000000e+00
7   comp122428_c0_seq1  624 0.000000    0   0   0.000000    0.000000    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    F   0.000000e+00

And I need the target_id and the eff_counts columns.

This is not the complete problem, but I thought I'd start small. Later I want the target ID only to be present once at the beginning. And I would like to have a heading in the new file that contains the name of the file that contributed to the particular column.

target_id             file_1        file_2        file_3
comp26522_c0_seq1     0.000000      [number]      [number]
comp31034_c0_seq1     10.629771     [number]      [number]
comp36102_c0_seq1     11.224893     [number]      [number]
comp75418_c0_seq1     0.008832      [number]      [number]
comp76235_c0_seq1     69.198412     [number]      [number]
comp122428_c0_seq1    0.000000      [number]      [number]
comp165370_c0_seq1    0.000000      [number]      [number]

Edit: I added more information to the examples. The [number] are only placeholders; in reality, they would be numbers similar to the row under file_1. Also, the header "file_1" would be the name of the input file. And the target_id should be sorted. All files should include the same target_ids, but all in a different order.

Edit two: output

I tested it with four files and the output looks like this:

    comp0_c0_seq1   0.000000
    comp100000_c0_seq1      1.919404
    comp100002_c0_seq1      2.118776
    comp100003_c0_seq1      0.072916
    comp100004_c0_seq1      0.000000
    comp100005_c0_seq1      0.000000
    comp100006_c0_seq1      1.548160
    comp100007_c0_seq1      7.616481
    comp100008_c0_seq1      0.000000
    comp100009_c0_seq1      1.374209

there is an empty column to the left of the first column with data. And only the data from the last file is present.

Thank you for your help!

Update:

I solved the issue I had with the second last line. This is the code I used:

while read line; do
     join output "$line" > output2
     cat output2 > output
done < list3

This is the output:

comp0_c0_seq1      0.000000 0.000000 0.000000 0.000000
comp100000_c0_seq1 1.919404 1.919404 0.000000 1.919404
comp100002_c0_seq1 2.118776 2.118776 2.225852 2.118776
comp100003_c0_seq1 0.072916 0.072916 1.228136 0.072916
comp100004_c0_seq1 0.000000 0.000000 0.000000 0.000000
comp100005_c0_seq1 0.000000 0.000000 1.982851 0.000000
comp100006_c0_seq1 1.548160 1.548160 1.902749 1.548160
comp100007_c0_seq1 7.616481 7.616481 0.000000 7.616481
comp100008_c0_seq1 0.000000 0.000000 0.000000 0.000000
comp100009_c0_seq1 1.374209 1.374209 1.378667 1.374209

Now I just need to figure out how to add a header with all the file names to the top of the file.

The second part is not clear. what does `23` `0` `100` signify? — nu11p01n73R, Oct 09 '14 at 09:22
Those are the numbers that are in the column eff_counts. In the end they should be rounded to an integer, that's probalby why I put those numbers in as an example, but in the file they are floats. It's something I thought I can easily correct in R once I got the proper input file. — Roger Huerlimann, Oct 09 '14 at 09:39
In the desired output 23 is total of eff_counts in file_1 of the specific target_id? Can you give more rows of source data to get a better picture? — Kokkie, Oct 09 '14 at 15:08
I don't have access to the files at the moment, but the present output is basically an empty column, followed by the data from the second file (a column with the compXXXX and one with the numbers of the second file). I've only tested two files input so far. I will add the actual output as soon as possible. — Roger Huerlimann, Oct 10 '14 at 04:03
My main issue at the moment is how to fix the second last line, so it actually horizontally concatenates all the files properly. — Roger Huerlimann, Oct 12 '14 at 02:26
using cat >> output in the second last line seems to bring me somewhat closer to the solution, but doesn't quite solve it. It only ever concatenates two file outputs together, not all of them. and each new concatenating step is place underneath the next. I would have thought it would take output as an input file, add the columns of the next file, save it in output, which now contains the data of two files, this is then used as an input for the next iteration, adding another data column so there are now three columns. and so forth. — Roger Huerlimann, Oct 14 '14 at 08:06
I think join might do the trick, but it can only join two files, while I have multiple files. Is there a way to iterate over a list of files using join to join two files at the time? — Roger Huerlimann, Oct 14 '14 at 08:46

score 1 · Answer 1 · answered Oct 15 '14 at 06:18

After a lot of reading and testing, I finally came up with a script that does exactly what I want.

It might not be the most efficient use of bash in places, but it works just fine.

ls *.xprs > list

while read line; do
    echo "parsing $line"
    awk '{print $2}' "$line" > f1
    awk '{print $8}' "$line" > f2
    paste f1 f2 | sort | head -n -1 > "$line".output
done < list

ls *.output > list2

head -n 1 list2 > start

while read line; do
    cat "$line" > output
done < start

tail -n +2 list2 > list3

while read line; do
    join output "$line" > output2 2>/dev/null
    cat output2 > output
done < list3
sed '1i Contig_ID' list2 | awk '{printf("%s ", $0)}' | sed -e '$a\' | sed 's/.xprs.output//g' > list4

cat list4 output > results.txt

Hi Roger, I see it works because I get the same results with my solution. — Kokkie, Oct 16 '14 at 10:20
Hi Kokkie, Thanks for adding your solution! I'm always eager to learn new ways. Are the contig_ids matched between the files? The problem I had was that the order of the rows was different in each file, that's why I used sort and then join. I will have to go through your solution in detail and see if I can figure out how you solved the problem. — Roger Huerlimann, Oct 17 '14 at 00:40

score 0 · Accepted Answer · edited May 23 '17 at 12:05

You could also begin with a collection of the filenames and the columns of interest like below and than transpose it using a solution like here: Transpose CSV data with awk (pivot transformation)

find . -name "bundle*.txt" -exec awk 'NR>1 {print FILENAME,$2,$8}' {} \; | sed 's/.\//''/' > superbundle.txt

Explanation
- find all files with a name like bundle*.txt
- execute an awk statement which shows the filename and column 2 and 8 (without the header)
- use sed to remove ./ from the filenames

Now we can use the "superbundle.txt" and transpose it using the mentioned solution from jaypal.

$ cat transpose.awk
{
    if(!($1 in filenames)) { filename[++types] = $1 }; filenames[$1]++
    if(!($2 in target_ids)) { target_id[++num] = $2 }; target_ids[$2]++
    map[$1,$2] = $3
}
END {
    printf "%s\t" ,"target_id";
    for(ind=1; ind<=types; ind++) {
        printf "%s%s", sep, filename[ind];
        sep = "\t"
    }
    print "";
    for(target=1; target<=num; target++) {
        printf "%s", target_id[target]
        for(val=1; val<=types; val++) {
            printf "%s%s", sep, map[filename[val], target_id[target]];
        }
        print ""
    }
}

Below output only shows three files, since I've only created 3 bundle example text files.

$ awk -f transpose.awk superbundle.txt | column -t
target_id           bundle.txt  bundle2.txt  bundle3.txt
comp165370_c0_seq1  0.000000    1.000000     0.000000
comp75418_c0_seq1   0.008832    2.008832     1.008832
comp76235_c0_seq1   69.198412   3.198412     2.198412
comp31034_c0_seq1   10.629771   4.629771     3.629771
comp36102_c0_seq1   11.224893   5.224893     4.224893
comp26522_c0_seq1   0.000000    6.000000     4.000000
comp122428_c0_seq1  0.000000    7.000000     4.000000

Thank you for your input. Your solution looks much more elegant than mine. It's not quite yet doing what I need it to do. The file names should go into the header, and the numbers belonging to each file should be arranged horizontally, not vertically (see the second last code block in my post). Is this possible with the find/awk? — Roger Huerlimann, Oct 14 '14 at 07:06
I am always getting an error message, but would something like this be possible? find . -name "*txt.output" -exec paste {} > output — Roger Huerlimann, Oct 14 '14 at 08:14

Horizontal merging of columns from multiple files

2 Answers2