Replace tip of newick file using reference list in bash

Question

I have a collection of newick-formatted files containing gene IDs:

((gene1:1,gene2:1)100:1,gene3:1)100;
((gene4:1,gene5:1)100:1,gene6:1)100;

I have a list of equivalence between gene ID and species name:

speciesA=(gene1,gene4)
speciesB=(gene2,gene5)
speciesC=(gene3,gene6)

I would like to get the following output:

((speciesA:1,speciesB:1)100:1,speciesC:1)100;
((speciesA:1,speciesB:1)100:1,speciesC:1)100;

Any idea of how I could proceed? Ideally in bash would be awesome :)

score 3 · Answer 1 · answered Mar 27 '15 at 15:33

3

Here's an awk one-liner that does what you want:

$ awk -F'[()=,]+' 'NR==FNR{a[$2]=a[$3]=$1;next}{for(i in a)gsub(i,a[i])}1' species gene
((speciesA:1,speciesB:1)100:1,speciesC:1)100;
((speciesA:1,speciesB:1)100:1,speciesC:1)100;

Go through the file containing the mappings between the species and genes, saving them as key-value pairs in the array a. NR==FNR targets the first file passed to awk as the total line number NR is equal to the line number in the current file FNR. next skips any further instructions. Go through the second file and make the substitutions.

answered Mar 27 '15 at 15:33

Tom Fenech

72,334
12
107
141

I can't accept both answers but thank you very much all the same! :) – tlorin Mar 30 '15 at 14:48
@tlorin no problem. If, for whatever reason, you intend on using the answer you have accepted, I suggest that you at least run it through http://shellcheck.net, which will pick up the numerous bad practices it contains. However I would strongly recommend that you use the right tool for the job, which is awk. If there is anything I can do to explain my answer better, please let me know. – Tom Fenech Mar 30 '15 at 15:01

Sam · Accepted Answer · 2015-03-27T15:53:19.777

-1

input.txt

((gene1:1,gene2:1)100:1,gene3:1)100;
((gene4:1,gene5:1)100:1,gene6:1)100;

equivs.txt

speciesA=(gene1,gene4)
speciesB=(gene2,gene5)
speciesC=(gene3,gene6)

convert.sh

#!/bin/bash


function replace() {
    output=$1
    for line in $(cat equivs.txt)  #this will fail if there is whitespace in your lines!
    do
        #gets the replacement string
        rep=$(echo $line | cut -d'=' -f1)

        #create a regex of all the possible matches we want to replace with $rep 
        targets=$(echo $line | cut -d'(' -f2- | cut -d')' -f1) 
        regex="($(echo $targets | sed -r 's/,/|/g'))"

        #do the replacements   
        output=$(echo $output | sed -r "s/${regex}/${rep}/g")
    done
    echo $output
}

#step through the input, file calling the above function on each line.
#assuming all lines are formatted like the example!
for line in $(cat input.txt)
do
    replace $line
done

output:

((speciesA:1,speciesB:1)100:1,speciesC:1)100;
((speciesA:1,speciesB:1)100:1,speciesC:1)100;

edited Mar 27 '15 at 15:53

answered Mar 27 '15 at 14:39

Sam

14
3

1

Any explanation for the code (even as code comments) would be nice – ryanyuyu Mar 27 '15 at 14:45
`for line in $(cat file)` is a bad idea - see http://mywiki.wooledge.org/DontReadLinesWithFor. You should use a `while read` loop, or better yet, use the right tool for the job, which in this case is awk. – Tom Fenech Mar 27 '15 at 15:37
It's fine if there are no spaces, @TomFenech Besides, I assumed I was just doing someone's homework, not curing cancer. – Sam Mar 27 '15 at 15:45
It's up to you whether you take my advice or not, I'm just trying to help improve your answer. – Tom Fenech Mar 27 '15 at 15:47
My answer is terrible, but not because of my use of 'for'. – Sam Mar 27 '15 at 16:52
Thanks very much! Indeed, as of the whitespace, I don't have any so it's working perfectly! – tlorin Mar 30 '15 at 14:47
@tlorin heh, you should really listen to Tom. My answer is pretty terrible if your problem scales at all. I just whipped out a working solution hastily. Even though his answer looks confusing, it's much much better. – Sam Apr 07 '15 at 14:49

Replace tip of newick file using reference list in bash

2 Answers2