I tried to write a script that takes two columns each from multiple files and concatenates them together horizontally. The problem is, the content of the columns is not in the same order in the files, so the data needs to be sorted before concatenating.
This is what I've come up with so far:
!/bin/bash
ls *.txt > list
while read line; do
awk '{print $2}' "$line" > f1
awk '{print $8}' "$line" > f2
paste f1 f2 | sort > "$line".output
done < list
ls *.output > list2
head -n 1 list2 > start
while read line; do
cat "$line" > output
done < start
tail -n +2 list2 > list3
while read line; do
paste output "$line" | cat > output
done < list3
My programing is probably not that efficient, but it does what I want it to do, with the exception of the second last line, which does not concatenate the files together properly. If I enter the line in the command-line it works fine, but in the while loop it misses columns.
The data files look like this:
bundle_id target_id length eff_length tot_counts uniq_counts est_counts eff_counts ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low fpkm_conf_high solvable tpm
1 comp165370_c0_seq1 297 0.000000 0 0 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00
2 comp75418_c0_seq1 1371 852.132325 35 0 0.005490 0.008832 8.287807e-04 5.283100e+00 4.583199e-04 0.000000e+00 2.425095e-02 T 6.225299e-04
3 comp76235_c0_seq1 1371 871.645349 44 9 43.994510 69.198412 2.002884e+00 3.142003e-04 3.590738e+00 3.516301e+00 3.665174e+00 T 4.877251e+00
4 comp31034_c0_seq1 379 251.335522 14 0 7.049180 10.629771 1.000000e+00 1.000000e+00 1.995307e+00 0.000000e+00 5.957982e+00 F 2.710199e+00
5 comp36102_c0_seq1 379 234.689179 14 0 6.950820 11.224893 1.000000e+00 1.000000e+00 2.107017e+00 0.000000e+00 6.350761e+00 F 2.861933e+00
6 comp26522_c0_seq1 220 0.000000 0 0 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00
7 comp122428_c0_seq1 624 0.000000 0 0 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 F 0.000000e+00
And I need the target_id and the eff_counts columns.
This is not the complete problem, but I thought I'd start small. Later I want the target ID only to be present once at the beginning. And I would like to have a heading in the new file that contains the name of the file that contributed to the particular column.
target_id file_1 file_2 file_3
comp26522_c0_seq1 0.000000 [number] [number]
comp31034_c0_seq1 10.629771 [number] [number]
comp36102_c0_seq1 11.224893 [number] [number]
comp75418_c0_seq1 0.008832 [number] [number]
comp76235_c0_seq1 69.198412 [number] [number]
comp122428_c0_seq1 0.000000 [number] [number]
comp165370_c0_seq1 0.000000 [number] [number]
Edit: I added more information to the examples. The [number] are only placeholders; in reality, they would be numbers similar to the row under file_1. Also, the header "file_1" would be the name of the input file. And the target_id should be sorted. All files should include the same target_ids, but all in a different order.
Edit two: output
I tested it with four files and the output looks like this:
comp0_c0_seq1 0.000000
comp100000_c0_seq1 1.919404
comp100002_c0_seq1 2.118776
comp100003_c0_seq1 0.072916
comp100004_c0_seq1 0.000000
comp100005_c0_seq1 0.000000
comp100006_c0_seq1 1.548160
comp100007_c0_seq1 7.616481
comp100008_c0_seq1 0.000000
comp100009_c0_seq1 1.374209
there is an empty column to the left of the first column with data. And only the data from the last file is present.
Thank you for your help!
Update:
I solved the issue I had with the second last line. This is the code I used:
while read line; do
join output "$line" > output2
cat output2 > output
done < list3
This is the output:
comp0_c0_seq1 0.000000 0.000000 0.000000 0.000000
comp100000_c0_seq1 1.919404 1.919404 0.000000 1.919404
comp100002_c0_seq1 2.118776 2.118776 2.225852 2.118776
comp100003_c0_seq1 0.072916 0.072916 1.228136 0.072916
comp100004_c0_seq1 0.000000 0.000000 0.000000 0.000000
comp100005_c0_seq1 0.000000 0.000000 1.982851 0.000000
comp100006_c0_seq1 1.548160 1.548160 1.902749 1.548160
comp100007_c0_seq1 7.616481 7.616481 0.000000 7.616481
comp100008_c0_seq1 0.000000 0.000000 0.000000 0.000000
comp100009_c0_seq1 1.374209 1.374209 1.378667 1.374209
Now I just need to figure out how to add a header with all the file names to the top of the file.