0

I'm new to this website. Here's a problem that troubled me for >2 hr. I have a string (phylogenetic tree in newick format), which looks like:

((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35);

The tree may have multiple levels, indicated by parentheses. Now I want to add a number, say, 10, to the top level numbers (branch lengths). Here there are only three top level numbers: 22, 76, 35. After the convertion the string should look like:

((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45);

I have tried my best thinking out a proper regex, but finally admitted my limitation. How can it be done really?

Community
  • 1
  • 1
Qiyun Zhu
  • 125
  • 1
  • 8
  • 4
    I'd simply parse the string into an actual tree hierarchy of objects, do your work and serialize the tree back into a string. I don't know if that's feasible in your situation, but it seems like the logical way to do it. – Mattias Buelens Jul 31 '12 at 13:51
  • 1
    Have you considered [BioPerl](https://metacpan.org/release/BioPerl)? I believe it includes tools for parsing and manipulating your tree thingy. I don't know much about it, but a search for [newick](https://metacpan.org/search?q=newick) turned it up and I was already guessing something like that existed under that banner. – zostay Jul 31 '12 at 13:52
  • [Bio::PhyloNetwork](http://search.cpan.org/perldoc?Bio%3A%3APhyloNetwork) perhaps. – TLP Jul 31 '12 at 16:42
  • Thanks guys for recommending Bioperl. I had it installed in my own workstation. It does have extensive tree functions. But for this particular thing I just figure it may work in computers without Bioperl. For example, an office Mac desktop. So a general solution without special libraries may be better. – Qiyun Zhu Aug 01 '12 at 20:56

4 Answers4

1

Although I would opt for parsing the whole tree, the problem can be solved when using only regexes:

use strict; use warnings; use feature qw(say);
my $string = "((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35)";
$string =~ s/^\(//;
$string =~ s/\)$//;
$string =~ s{
    \G ((?&PRELEM)) : (\d+) (,|$)
    (?(DEFINE)
        (?<SUBLIST> [(] (?&ELEM)(?:,(?&ELEM))* [)] )
        (?<ELEM> (?&PRELEM) : \d+ )
        (?<PRELEM> (?:[A-Z]|(?&SUBLIST)) )
    )
 }{"$1:".($2+10).$3}gex;
 say "($string)";

Prints ((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45).

I define a small grammar for top-down recursive parsing, please adapt as needed. On the top level, we have uninteresting Pre-Elements, which we store in $1 They can be a single letter or a tree enclosed in parenthesis. After a : comes the number which we want to increment, stored in $2. It is followed by the end of string or a comma. We match iteratively, starting where the last match left of (Symbolized by the /g option and the \G assertion). The addition happens when we build the substitution string (We are using the /e option).

amon
  • 57,091
  • 2
  • 89
  • 149
1
s/(?:^\(|(\((?:(?>[^()]*)|(?1))*\)))\K|:\K([0-9]+)/$2?$2+10:""/ge

Match either things you want to skip or digits preceded by a :.

Things you want to skip are either the leading ( or any balanced set of parentheses (balanced parentheses regex taken almost literally from perlre).

In the substitution, add ten if digits to be modified were matched, otherwise match nothing.

But you are better off not being clever and instead going to the work to parse, modify, and reserialize your tree.

ysth
  • 96,171
  • 6
  • 121
  • 214
  • Thanks, ysth! I spent another hour learning the regex you wrote. I did learn a lot including these extended patterns, recursing matching, K modifier etc. – Qiyun Zhu Aug 01 '12 at 20:40
1

This needs a recursive regular expression to match the nested parentheses.

First define a 'key', which is either a string of capital letters or any number of key:value pairs between parentheses.

Then find all keys followed by a colon and a decimal number and do the arithemtic on the number.

use strict;
use warnings;

my $str = '((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35)';

my $key = qr/ (?<key> [A-Z]+ | \( (?&key) : \d+ (?: , (?&key) : \d+ )* \)  ) /x;

$str =~ s/$key : \K ( \d+ ) /$2 + 10/xge;

print $str;

output

((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45)
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • How did you learn and practice these new features like `?&name`, `\k\K` syntax etc. From perlre? I'm trying to find some good reference material to keep up with these new fancy things. – rubber boots Jul 31 '12 at 17:49
  • It would be really awesome if there were a regex for "pattern repeated with separator". Just scratching, something like `(?:(?&key):\d+){,\s*}` It would also make package expressions easier: `(?:\p{Upper}\w*){::}` /musing – Axeman Jul 31 '12 at 18:15
  • That's a great solution, and written in an easy-to-read form. I found perlre is one of the only places I can learn these advanced regex techs. – Qiyun Zhu Aug 01 '12 at 20:42
0

First, I'd like to thank ysth for his very interesting posting in this thread. From this posting, I learned how and why to apply the \Keep modifier.

I added another \K (to the first subexpression) and made use of the new ++notation for atomic groups:

my $r = qr{
  (?:
     (?: ^ \(\K )
     |
     (
       \( (?: [^()]++ | (?1) )* \)
     )\K
  )
  |
  :\K (\d+)
}x;

The output string now matches exactly the input string - except for the incremented values:

$t =~ s/$r/$2?$2+10:''/ge;

input:  ((A:14,B:43):22,C:76,(D:54,(E:87,F:28):17):35)
output: ((A:14,B:43):32,C:86,(D:54,(E:87,F:28):17):45)
Community
  • 1
  • 1
rubber boots
  • 14,924
  • 5
  • 33
  • 44
  • oops, I had that \K in the wrong place, sorry for any confusion. \K is a very handy feature. – ysth Aug 01 '12 at 05:32
  • Thanks for the code! I used not to know ?: and wrote recursing brackets matching in a () way. Your code explains that well to me. ysth, you have edited the /K position, did you? I modified your code a little and that fits my real application well now. Seems no problem at all. – Qiyun Zhu Aug 01 '12 at 20:48