5

I have a list in python consisting of one item which is a tree written in Newick Format, as below:

['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']

In tree format this appears as below:

enter image description here

I am trying to write some code that will look through the list item and return the IDs (BMNHxxxxxx) which are joined by branch length of 0 (or <0.001 for example) (highlighted in red). I thought about using regex such as:

JustTree = []
with JustTree as f:
    for match in re.finditer(r"(?<=Item\sA)(?:(?!Item\sB).){50,}", subject, re.I):
        f.extend(match.group()+"\n") 

As taken from another StackOverflow answer where item A would be a ':' as the branch lengths always appear after a : and item B would be either a ',' or ')'or a ';' as these a there three characters that delimit it, but Im not experienced enough in regex to do this.

By using a branch length of 0 in this case I want the code to output ['BMNH703458a', 'BMNH703458b']. If I could alter this to also include ID's joined by a branch length of user defined value of say 0.01 this would be highly useful.

If anyone has any input, or can point me to a useful answer I would highly appreciate it.

PaulBarr
  • 919
  • 6
  • 19
  • 33
  • But to me, those two IDs are not joined by branch length of `0`, but by branch length of `0.00000328449424529074`. Is there a certain degree of precision you consider to be insignificant? – Jerry Apr 19 '14 at 16:31
  • @Jerry Apologies, Ill edit my question, yes I made the assumption that 0.00000328449424529074 was not significantly different from 0 – PaulBarr Apr 19 '14 at 16:43
  • Whilst, this works in this particular example it doesnt work in all examples. So to explain newick format say we had a tree with three species: A, B and C where A and B were more related than either to C. In the tree I uploaded look at the three species directly above the red highlighted branches to see what I mean. In Newick format this would be written as ((A,B),C). To include branch lengths you add the length after a ':'. So while your example works in this case you can see that by increasing the variable 0\.000 it will start to put together IDs that arent closely related. – PaulBarr Apr 19 '14 at 17:02
  • The way that I was thinking of approaching this problem was to extract all the branch lengths into a list (BranchLst1), add those that were below a user defined input to another list (SmallBranchLst2) and then match these branches back to the original Newick Tree list. I can do the second bit but I cant make a regex that extracts all the branch lengths from the Newick Tree List and puts these into another list – PaulBarr Apr 19 '14 at 17:05
  • Thankyou, I appreciate your help! yes related ID's will be identical, if it makes it easier we can ignore the a or b as I added this in to help myself identify the ID's that should go together. This would mean all ID's in the tree are identical, they start with BMNH and end with 6 numbers – PaulBarr Apr 19 '14 at 17:09
  • [Here's one for matching IDs except the last character](http://regex101.com/r/jA8vA5) and [a different one to extract only numbers](http://regex101.com/r/tB8yX1) – Jerry Apr 19 '14 at 17:09
  • The one used to extract only the numbers is perfect! That is exactly what I was looking for, thankyou! Could you include in your answer the best way of getting these numbers from the original list containing the NewickTree into a new list where each item is a branch length? – PaulBarr Apr 19 '14 at 17:13

3 Answers3

2

Okay, here's a regex to extract only numbers (with potential decimals):

\b[0-9]+(?:\.[0-9]+)?\b

The \bs make sure that there is no other number, letter or underscore around the number right next to it. It's called a word boundary.

[0-9]+ matches multiple digits.

(?:\.[0-9]+)? is an optional group, meaning that it may or may not match. If there is a dot and digits after the first [0-9]+, then it will match those. Otherwise, it won't. The group itself matches a dot, and at least 1 digit.

You can use it with re.findall to put all the matches in a list:

import re
NewickTree = ['(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;']

pattern = re.compile(r"\b[0-9]+(?:\.[0-9]+)?\b")

for tree in NewickTree:
    branch_lengths = pattern.findall(tree)
    # Do stuff to the list branch_lengths
    print(branch_lengths)

For this list, you get this printed:

['0.16529463651919140688', '0.22945757727367316336', '0.18028180766761139897',
 '0.21469677818346077913', '0.54350916483644962085', '0.00654573856803835914', 
 '0.04530853441176059537', '0.02416511342888815264', '0.21236619242575086042',
 '0.13421900772403019819', '0.14957653992840658219', '0.02592135486124686958', 
 '0.02477670174791116522', '0.22983459269245612444', '0.00000328449424529074',
 '0.29776257618061197086', '0.09881729077887969892', '0.02257522897558370684',
 '0.21599133163597591945', '0.02365043128986757739', '0.16069861523756587274',
 '0.0']
Jerry
  • 70,495
  • 13
  • 100
  • 144
  • Regex could be simplified to `r"\b[\d.]+"`. You could translate the strings to floats: `branch_lengths = [float(x) for x in branch_lengths]` – ooga Apr 19 '14 at 18:24
  • Simplifying doesn't always mean better. That regex will also match a lot of dots, and any other characters that `\d` can match besides English numbers. – Jerry Apr 19 '14 at 18:32
  • Just trying to be helpful, Jerry. :) – ooga Apr 19 '14 at 18:35
  • And I'm just saying why that won't necessarily work better :) – Jerry Apr 19 '14 at 18:38
  • Converting to floats is a good idea though, but I'm sure the OP could figure that out for himself. :) See if you can better-implement my nested-list routine below. I feel it's pythonically suboptimal. – ooga Apr 19 '14 at 18:42
  • Well, for first, I never said there was anything wrong with the translation to float, and second, I believe the OP should be able to do that. It's up to you if you want to spoonfeed everything. – Jerry Apr 19 '14 at 18:46
2

I know your question has been answered, but if you ever want your data as a nested list instead of a flat string:

import re
import pprint

a="(BMNH833953:0.16529463651919140688,(((BMNH833883:0.22945757727367316336,(BMNH724182a:0.18028180766761139897,(BMNH724182b:0.21469677818346077913,BMNH724082:0.54350916483644962085):0.00654573856803835914):0.04530853441176059537):0.02416511342888815264,(((BMNH794142:0.21236619242575086042,(BMNH743008:0.13421900772403019819,BMNH724591:0.14957653992840658219):0.02592135486124686958):0.02477670174791116522,BMNH703458a:0.22983459269245612444):0.00000328449424529074,BMNH703458b:0.29776257618061197086):0.09881729077887969892):0.02257522897558370684,BMNH833928:0.21599133163597591945):0.02365043128986757739,BMNH724053:0.16069861523756587274):0.0;"

def tokenize(str):
  for m in re.finditer(r"\(|\)|[\w.:]+", str):
    yield m.group()

def make_nested_list(tok, L=None):
  if L is None: L = []
  while True:
    try: t = tok.next()
    except StopIteration: break
    if   t == "(": L.append(make_nested_list(tok))
    elif t == ")": break
    else:
      i = t.find(":"); assert i != -1
      if i == 0: L.append(float(t[1:]))
      else:      L.append([t[:i], float(t[i+1:])])
  return L

L = make_nested_list(tokenize(a))
pprint.pprint(L)
ooga
  • 15,423
  • 2
  • 20
  • 21
2

There are several Python libraries that support the newick format. The ETE toolkit allows to read newick strings and operate with trees as Python objects:

from ete2 import Tree
tree = Tree(newickFile)
print tree

Several newick subformats can be choosen and branch distances are parsed even if they are expressed in scientific notation.

from ete2 import Tree
tree = Tree("(A:3.4, (B:0.15E-10,C:0.0001):1.5E-234);")
jhc
  • 1,671
  • 3
  • 13
  • 16