1

I would like to produce phylogenetic trees from genetic data. I have found a few tree-drawing packages in R and python that look great, e.g. ggtree in R. But these require data inputs that are already in a tree format e.g. Newick.

I think most people start with vcf files and produce FASTA files, but my starting point is a table of genotypes - I work with a haploid organism so each position is either 0 (ref) or 1 (non-ref). From this I compute pairwise genetic distance using dist() in R. Example data for 5 samples, A-E, with pairwise distance over ten variant positions:

# Generate dataframe with example genotypes
Variant <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
A <- c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0)
B <- c(1, 1, 0, 0, 1, 1, 0, 0, 1, 1)
C <- c(0, 0, 1, 1, 0, 0, 1, 1, 0, 1)
D <- c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)
E <- c(1, 0, 0, 0, 1, 1, 0, 0, 1, 1)
df = data.frame(Variant, A, B, C, D, E)
df
#  Remove first column with variant names
df$Variant <- NULL
# Transpose the columns and rows
df_t = t(df)
# Compute pairwise distance (Euclidean)
pdist = dist(df_t, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
pdist

I would like to produce a hierarchical tree output file from pdist e.g. in Newick format, so that I can plug that into packages like ggtree to draw nice trees e.g. circular phylograms, coloured with co-variates etc. I've tried searching but not sure where to start.

EDIT/ UPDATE This website was helpful http://www.phytools.org/Cordoba2017/ex/2/Intro-to-phylogenies.html I used packages: ape, phangorn, phytools, geiger

This code seems to work -

# Produce dendrogram
hclust = hclust(pdist)
# Check dendrogram looks sensible
plot(hclust)
class(hclust) # check that class is hclust
# Save to Newick file
my_tree <- as.phylo(hclust) 
write.tree(phy=my_tree, file="ExampleTree.newick") # Writes a Newick file
# Produce tree
plot(unroot(my_tree),type="unrooted",cex=1.5,
     use.edge.length=TRUE,lab4ut="axial",
     edge.width=2,
     no.margin=TRUE)

Output tree: Unrooted distance tree from example data

Will Hamilton
  • 357
  • 2
  • 17

2 Answers2

2

This is a non-trivial task. To build a tree (as in a bifurcating one) from a distance matrix, you will need to use phylogenetic algorithms and probably better not do it from a distance matrix (note that there might be drawbacks from using Euclidean distance for a binary matrix as well).

However, that said, the task can still be done using the phangorn package. For example, you can create a spectra of splits from the distance matrix (i.e. the likely splits present in the matrix (more details here - paywalled).

require(phangorn)
# Generate dataframe with example genotypes
Variant <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
A <- c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0)
B <- c(1, 1, 0, 0, 1, 1, 0, 0, 1, 1)
C <- c(0, 0, 1, 1, 0, 0, 1, 1, 0, 1)
D <- c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)
E <- c(1, 0, 0, 0, 1, 1, 0, 0, 1, 1)
df = data.frame(Variant, A, B, C, D, E)
df
#  Remove first column with variant names
df$Variant <- NULL
# Transpose the columns and rows
df_t = t(df)
# Compute pairwise distance (Euclidean)
pdist = dist(df_t, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)

# calculate the Hadamard distance spectrum
distances <- distanceHadamard(as.matrix(pdist))
# representing the distances
lento(distances)
# Plotting the distances as a tree (a network actually)
plot(as.networx(distances), "2D")

Note that in the same package neighborNet is also available but the manual highlights that this function is experimental. I suggest contacting the package author for more information.

You can then transform your network in a "phylo" that can be used by ape and probably by ggtree by coercing it:

# Converting into a phylo object
phylo <- as.phylo(distances)

But again, note that this resulting tree is probably incorrect in a phylogenetic sense (i.e. assuming descent with modification) and I strongly suggest simply estimating the tree using a model based approach (e.g. with MrBayes or BEAST2).

Thomas Guillerme
  • 1,747
  • 4
  • 16
  • 23
  • Thanks for your reply. I have amended my question with the solution I found, which seems to work. I will do some reading to understand what the approaches you suggested (MrBayes/ BEAST2) involve! – Will Hamilton Oct 08 '18 at 07:30
0

Binary data can be efficiently used to construct a phylogenetic tree with MrBayes, as @thomas-guillerme mentioned. The input file should include a binary data block and mrbayes commands.

#nexus
begin data;
dimensions ntax = 5 nchar = 10;
format datatype = restriction;
matrix
A 0011001100
B 1100110011
C 0011001101
D 1011001101
E 1000110011;
end;

begin mrbayes;
lset coding = variable;
mcmc ngen = 1000000 samplefreq = 1000;
sump burnin = 200;
sumt burnin = 200;
end;

The length of the mcmc run will need to be adjusted with respect to the chain convergence. As a start, the code should give a good idea on the relationships the data can infer.

nya
  • 2,138
  • 15
  • 29