3

So basically I have a string:

string_1 = '(((A,B)123,C)456,(D,E)789)135'

Containing a phylogenetic tree with bootstrap values is parenthetical notation (not really important to the question, but in case anyone was wondering). This example tree contains four relationships with four bootstrap values (the numbers following each close parenthesis). I have each of these relationships in a list of lists:

list_1 = [['(A,B)', 321], ['((A,B),C)', 654],
          ['(D,E)', 987], ['(((A,B),C),(D,E))', 531]]

each containing a relationship and its updated bootstrap value. All I need to do is to create a final string:

final = '(((A,B)321,C)654,(D,E)987)531'

where all the bootstrap values are updated to the values in list_1. I have a function to remove bootstrap values:

import re

def remove_bootstrap(string):
   matches = re.split(r'(?<=\))\d+\.*\d*', string)
   matches = ''.join(matches)
   return matches 

and code to isolate relationships:

list_of_bipart_relationships = []
for bipart_file in list_bipart_files:
   open_file = open(bipart_file)
   read_file = open_file.read()
   length = len(read_file)
   for index in range(1, length):
      if read_file[index] == '(':
         parenthesis_count = 1
         for sub_index in range(index + 1, length):
            if read_file[sub_index] == '(':
               parenthesis_count += 1
            if read_file[sub_index] == ')':
               parenthesis_count -= 1
            if parenthesis_count == 0:
               bad_relationship = read_file[index:sub_index + 1]
               relationship_without_values = remove_length(bad_relationship)
               bootstrap_value = extract(sub_index, length, read_file)
               pair = []
               pair.append(bootstrap_value)
               pair.append(relationship_without_values)
               list_of_bipart_relationships.insert(0, pair)
               break

and I am completely at a loss. I cannot figure out how to get the program to recognize a larger relationship once a nested relationship's bootstrap value is updated. Any help would be greatly appreciated!

xbello
  • 7,223
  • 3
  • 28
  • 41
Andrew WM
  • 31
  • 1

1 Answers1

1

This is a solution using Biopython. First you need to load your trees. If you're using strings, you'll need to load then first as StringIO, as the Parser only accepts file handles:

from io import StringIO
from Bio.Phylo.NewickIO import Parser

string_1 = u'(((A,B)123,C)456,(D,E)789)135'                        
handle = StringIO(string_1)

tree = list(Parser(handle).parse())[0]  # Assuming one tree per string

Now that you have the tree loaded, lets find the clades and update some values. This should be refactored to a function that accepts a list of clade names and returns a list of clades to pass to common_ancestor, but for illustrating:

clade_A = list(tree.find_clades(target="A"))[0]
clade_B = list(tree.find_clades(target="B"))[0]

tree.common_ancestor(clade_A, clade_B).confidence = 321

Now print the tree to a Newick format

print(tree.format("newick"))

# Outputs
# (((A:1.00000,B:1.00000)321.00:1.00000,C:1.00000)456.00:1.00000,(D:1.00000,E:1.00000)789.00:1.00000)135.00:1.00000;

Note the confidence value for (A, B) is now 321 instead 123.

xbello
  • 7,223
  • 3
  • 28
  • 41