There are a few threads that seem to be asking the same question as I'm interested in here, but some of the answers seem to be tricky to generalise (or I'm not smart enough). e.g.
how to replace strings in file based on values from another file? (example inside)
Replacing strings in file, using patterns from another file
I have some complicated files that look like this:
((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);
(These are Newick format phylogenetic trees in case anyone is interested)
I need to change all the ID keys (the bits that look like XXX_YYYYY) in this file and am not sure what the best approach would be.
They need to be replaced by the 'group' (operon) they belong to, and so I was thinking that making an index file of sorts would be the way to go, so for example, PLT_01696
gets replaced with group_1
say:
Keyfile:
PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
....
PAU_02074 group_2
So I think if I could pass a file to sed
or some equivalent, get it to read and look for the entry in column one, and replace it with whatever I've paired it with in column 2 is the best way to do this? This file will have about 350 individual keys in the end which will end up sorted in to around 12 groups.
And the file would end up looking like:
((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,group_1:0.08160284537473952438)98:0.04771898687201567291,.....
I'm open to alternative suggestions, this just seemed most apparent to me. This is on Ubuntu 14.04 so any solution is fair game really, but I'm much more au fait with bash (and a bit of perl).