0

I have established my gene clusters and already calculated the distances needed to measure their phylogenetic relationship. I used an algorithm basically gives a measure of distance between gene clusters and is represented in a dataframe such as (Input Example):

BGC1      BGC2     Distance
------------------------------ 
BGC31     BGC34     0.6
BGC34     BGC45     0.7
BGC34     BGC53     0.2
BGC53     BGC31     0.8

x <- data.frame(BGC1 = c('BGC31','BGC34','BGC34','BGC35'), 
                BGC2 = c('BGC34','BGC45','BGC53','BGC51'), 
                distance = c(0.6,0.7,0.2,0.8))

Goal: Would it be possible to construct a tree just based on this type of data? I want to have a .newick file available for this as well, I'm not sure if this is possible using R though.

However, I have been able to create network visualizations from this data through Cytoscape but not possibly a tree. Any further suggestions for this particular example?

Thanks once again for your input :)

  • 1
    My R is weak, but: I did this a while ago using python's BioPython module `Bio.Phylo.TreeConstruction` with `DistanceTreeConstructor` and `DistanceMatrix`. Wrangle your distances into the correct format for `DistanceMatrix`, convert it into a tree and draw the tree with upgma/nj. – Pallie Feb 12 '20 at 13:33
  • I can also try on python, I just had a preference in this case for R, however when you say wrangle your distances into the correct format? What does this imply? Sorry for my ignorance on this – bioinformatics_student Feb 12 '20 at 14:24
  • from https://biopython.org/DIST/docs/api/Bio.Phylo.TreeConstruction.DistanceMatrix-class.html : Distance matrix constructor takes names and matrix as arguments. The names are just a flat list of your genenames. Matrix is a lower triangular format distance matrix of all all genes vs all genes. – Pallie Feb 12 '20 at 14:38
  • @Pallie is it possible to use as the input for this, the matrix that I have in the example above? Currently my table of interest consists of these three columns. – bioinformatics_student Feb 12 '20 at 15:51

1 Answers1

0

Following the suggestion in a comment by user20650 here, you can define how to wrap the distances to a dist object using the lower.tri function. However, the provided example will not work, because it does not provide pairwise distances between samples. The solution thus takes your sample names, generates random data and then constructs the tree with the nj function from the ape package.

# get all sample names
x.names = unique(c(levels(x[, 1]), levels(x[, 2])))
n = length(x.names)

# create all combinations for samples for pairwise comparisons
x2 = data.frame(t(combn(x.names, m = 2)))
# generate random distances
set.seed(4653)
x2$distance = sample(seq(from = 0.1, to = 1, by = 0.05), size = nrow(x2), replace = TRUE)

# prepare a matrix for pairwise distances
dst = matrix(NA, ncol = n, nrow = n, dimnames = list(x.names, x.names))
# fill the lower triangle with the distances obtained elsewhere
dst[lower.tri(dst)] = x2$distance

# construct a phylogenetic tree with the neighbour-joining method
library(ape)
tr = nj(dst)
plot(tr)

enter image description here

The newick format of the tree can be saved with ape::write.tree function or printed to the console as:

cat(write.tree(tr))
# (BGC53:0.196875,BGC45:0.153125,(((BGC35:0.025,BGC51:0.275):0.1583333333,BGC31:0.2416666667):0.240625,BGC34:0.246875):0.003125);
nya
  • 2,138
  • 15
  • 29
  • Thanks for the reply, your post is quite helpful for orientation and I think I can adapt. In this case a distance matrix is created by calculating the distance between every pair of BGC in the data set, basically a **pairwise distance calculation** was done for all BGCs. I believe that the example that I provided was not a good one. – bioinformatics_student Feb 13 '20 at 10:07
  • do you think it would be possible to use this same method above considering that I do have a pairwise distance calculation set in place? – bioinformatics_student Feb 13 '20 at 10:07
  • Depending on how you have ordered your comparison in the vector. Use the empty matrix setup and then fill the `lower.tri` with your vector. Check whether the values are correctly assigned! – nya Feb 13 '20 at 14:26