replace strings in file from a reference list

Question

There are a few threads that seem to be asking the same question as I'm interested in here, but some of the answers seem to be tricky to generalise (or I'm not smart enough). e.g.

how to replace strings in file based on values from another file? (example inside)

Replacing strings in file, using patterns from another file

I have some complicated files that look like this:

 ((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);

(These are Newick format phylogenetic trees in case anyone is interested)

I need to change all the ID keys (the bits that look like XXX_YYYYY) in this file and am not sure what the best approach would be.

They need to be replaced by the 'group' (operon) they belong to, and so I was thinking that making an index file of sorts would be the way to go, so for example, PLT_01696 gets replaced with group_1 say:

Keyfile:

PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
....
PAU_02074 group_2

So I think if I could pass a file to sed or some equivalent, get it to read and look for the entry in column one, and replace it with whatever I've paired it with in column 2 is the best way to do this? This file will have about 350 individual keys in the end which will end up sorted in to around 12 groups.

And the file would end up looking like:

((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,group_1:0.08160284537473952438)98:0.04771898687201567291,.....

I'm open to alternative suggestions, this just seemed most apparent to me. This is on Ubuntu 14.04 so any solution is fair game really, but I'm much more au fait with bash (and a bit of perl).

To add to 123's comment: "Questions asking for homework help must include a summary of the work you've done so far to solve the problem, and a description of the difficulty you are having solving it." http://stackoverflow.com/help/on-topic — Mort, May 09 '16 at 13:40
`So I think if I could pass a file to sed or some equivalent` : In fact awk would be good, perl would be better :) — sjsam, May 09 '16 at 13:47
Yeah the complexity of the tree files means I'm at a bit of a loss - and there's nothing that I can do about it :/ ... @123, haven't tried anything directly yet because I'm not even sure what approach is most likely to work best. Currently my alternative is to go through and do them all by hand! — Joe Healey, May 09 '16 at 13:47
The first question you link to is really almost exactly the same as yours (minus a sed global flag, I'd say). Did you try those solutions? Where did you encounter problems? — Benjamin W., May 09 '16 at 13:49
There are multiple awk solutions for creating associative arrays from one file to use in another. — 123, May 09 '16 at 13:50

Jonathan Leffler · Answer 1 · 2016-05-09T14:01:00.540

One solution in such cases is to write a sed script that writes the sed script you want to execute. It appears that operons are preceded by either ( or , and are always followed by :. So, given your file containing mappings such as:

PLT_01736 group_1

then for each line in that file you want to create a sed operation that looks like:

s/\([,(]\)PLT_01736:/\1group_1:/g

where the g might not be necessary (I don't know if a given operon can appear more than once in a single line). The initial character class captures the ( or , and the \( and \) remember that, and it's followed by the specific ID key, and the colon; the replace operation outputs the remembered character, the replacement text and the colon. The advantage of tracking the preceding and following characters is that if by some mischance you have operons PLT_00100 and PLT_001001 (where one operon is a prefix of the other), tracking the surrounding characters ensures the correct match. Otherwise, you have to ensure that the longest matches appear first in the script, which is fiddlier (sort -r probably sorts that out, but …).

Hence, assuming the mappings are in a file mapping.data, you can use:

sed 's%\([A-Z]*_[0-9]*\)  *\(.*\)%s/\\([,(]\\)\1:/\\1\2:/g%' mapping.data > script.sed
sed -f script.sed newick.phylogenetic.tree.data > transformed.data

This uses % in the generating s%%% operation, outputting s/// (it requires some care). The search part of the s%%% looks for zero or more upper-case letters, an underscore, and zero or more digits, capturing that with the \( and \); followed by one or more spaces, followed by some other characters which are also captured. If the ID keys can have a different structure, then change the matching regex appropriately. I assume that the input data is 'clean' so there's no need to worry about only processing lines with exactly three letters, and underscore and exactly five digits, and there's no trailing blanks. With the two parts (key ID and replacement) isolated, it is just necessary to generate the output s/// command, remembering to double up the backslashes that must appear in the output.

Given your input data and list of keys, the output I get is:

((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);

score 2 · Accepted Answer · answered May 09 '16 at 13:50

2

I'll bite. Let's call the script phylo.awk:

NR==FNR { pattern[NR] = $1; replacement[NR] = $2; count++; next }
{
    for (i = 1; i <= count; i++) {
        sub(pattern[i], replacement[i])
    }
    print $0
}

Then say:

awk -f phylo.awk patterns data

answered May 09 '16 at 13:50

Michael Vehrs

3,293
11
10

Yet again my ignorance to awk is my downfall :P it always seems to have a solution where bash struggles. If this does the job that's very elegant indeed! – Joe Healey May 09 '16 at 13:59
Though it is not an obligation, it is always good to give a reason for the down-vote. +1d it. :). As I guessed the awk solution is neater here. – sjsam May 09 '16 at 14:01
This worked fantastically (and simple enough for my feeble brain to understand :P ) thanks very much! – Joe Healey May 09 '16 at 14:31

score 0 · Answer 3 · answered May 09 '16 at 14:35

#!/bin/bash

while read i;do #enter your loop

 a=$(echo "$i" | cut -d" " -f1) #get what to find
 b=$(echo "$i" | cut -d" " -f2) #get what to replace with

sed -i "s/$a/$b/g" input.txt #find and replace  -i is "in place"

done <ref.txt #define file you're looping through

input:

((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);

ref:

PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
PAU_02074 group_2

output:

((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);

This runs the `cut` command twice and the `sed` command once for each mapping to be done. That is not efficient compared with reading the mapping file once and the data file once — which other answers achieve. — Jonathan Leffler, May 09 '16 at 15:02

replace strings in file from a reference list

3 Answers3