I would like to produce phylogenetic trees from genetic data. I have found a few tree-drawing packages in R and python that look great, e.g. ggtree in R. But these require data inputs that are already in a tree format e.g. Newick.
I think most people start with vcf files and produce FASTA files, but my starting point is a table of genotypes - I work with a haploid organism so each position is either 0 (ref) or 1 (non-ref). From this I compute pairwise genetic distance using dist() in R. Example data for 5 samples, A-E, with pairwise distance over ten variant positions:
# Generate dataframe with example genotypes
Variant <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
A <- c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0)
B <- c(1, 1, 0, 0, 1, 1, 0, 0, 1, 1)
C <- c(0, 0, 1, 1, 0, 0, 1, 1, 0, 1)
D <- c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)
E <- c(1, 0, 0, 0, 1, 1, 0, 0, 1, 1)
df = data.frame(Variant, A, B, C, D, E)
df
# Remove first column with variant names
df$Variant <- NULL
# Transpose the columns and rows
df_t = t(df)
# Compute pairwise distance (Euclidean)
pdist = dist(df_t, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
pdist
I would like to produce a hierarchical tree output file from pdist e.g. in Newick format, so that I can plug that into packages like ggtree to draw nice trees e.g. circular phylograms, coloured with co-variates etc. I've tried searching but not sure where to start.
EDIT/ UPDATE This website was helpful http://www.phytools.org/Cordoba2017/ex/2/Intro-to-phylogenies.html I used packages: ape, phangorn, phytools, geiger
This code seems to work -
# Produce dendrogram
hclust = hclust(pdist)
# Check dendrogram looks sensible
plot(hclust)
class(hclust) # check that class is hclust
# Save to Newick file
my_tree <- as.phylo(hclust)
write.tree(phy=my_tree, file="ExampleTree.newick") # Writes a Newick file
# Produce tree
plot(unroot(my_tree),type="unrooted",cex=1.5,
use.edge.length=TRUE,lab4ut="axial",
edge.width=2,
no.margin=TRUE)