0

I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.

ID  gene1   gene2

1   ADRA1D  ADK
2   ADRA1B  ADK
3   ADRA1A  ADK
4   ADRB1   ASIC1
5   ADRB1   ADK
6   ADRB2   ASIC1
7   ADRB2   ADK
8   AGTR1   ACHE
9   AGTR1   ADK
10  ALOX5   ADRB1
11  ALOX5   ADRB2
12  ALPPL2  ADRB1
13  ALPPL2  ADRB2
14  AMY2A   AGTR1
15  AR  ADORA1
16  AR  ADRA1D
17  AR  ADRA1B
18  AR  ADRA1A
19  AR  ADRA2A
20  AR  ADRA2B

Below is my code in R

library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))

My figure is attached hereenter image description here

I have a question on how they are clustered.Since the pairs

 17 AR  ADRA1B
 18 AR  ADRA1A

and

 2  ADRA1B  ADK
 3  ADRA1A  ADK

should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.

Should I use any other method,if I am wrong in using this method(Euclidean distance)?

Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)

Updated Code :

   table=table(tab$gene1, tab$gene2)
   d <- dist(table,method="euclidean")
   fit <- hclust(d, method="ward")
   plot(as.phylo(fit))

However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well

enter image description here

Rgeek
  • 419
  • 1
  • 9
  • 23
  • 2
    I'm a bit puzzled. How can an euclidean distance be computed on strings? Are the gene1 and gene2 columns in your example really strings or factors? In case they are factor and `dist`calculates the euclidean distance on the factor levels, nothing reasonable can be expected, I think. – Georg Schnabel Feb 17 '14 at 21:04
  • I'm not sure, but it looks like you're trying to do clustering based on the *union* of the genes present in a given taxon, whereas I think that `hclust` will do clustering based on the identity of *each* gene -- i.e. if taxon 1 has `gene1=A`, `gene2=B` and taxon 2 has `gene2=B`, `gene2=A`, they won't match at all ... – Ben Bolker Feb 17 '14 at 21:11
  • @Georg Those are strings ,and I am looking for methods to get some relation out of this table to have some clustering and build a tree.i agree Euclidean can not be used,I just wanted to give an example,as to what I want. – Rgeek Feb 17 '14 at 21:33

1 Answers1

2

There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1 occurs with gene2 with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.

Before you can use the dist function, you have to bring your data into an appropriate format:

# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)

The obtained gene.matrix has the shape:

     ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,]  1    0   1      0      0      0      1      0
[2,]  2    0   1      0      0      1      0      0
[3,]  3    0   1      0      1      0      0      0
[4,]  4    0   0      0      0      0      0      0
...

So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1 if the gene is present and 0 if it is missing. On this matrix the dist function can be reasonably applied (ID column removed):

d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))

Maybe, it is a good idea to read up the differences between the distance measures euclidean, manhattan etc. For instance, the euclidian distance between the individuals with ID=1 and ID=2 is:

euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)

whereas the manhattan distance is

manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...
Georg Schnabel
  • 631
  • 4
  • 8
  • I used this command: table(tab$gene1, tab$gene2) to get matrix.I want to plot phylogeny of genes,and not as pairs.I want to see how they get clustered individually and not as pairs. – Rgeek Feb 17 '14 at 23:19
  • The number of group members is determined by the clustering algorithm. I think there is some information missing in your (example) data to properly perform hierarchial clustering. Every row should contain all the strings for the genes which are present in the individual. – Georg Schnabel Feb 18 '14 at 10:14
  • I was able to obtain the phylogenetic plot I was looking for.As I said I wanted to cluster individual genes(not pairs)so had to modify the code a little bit.However it might be interesting to do that as well.Is there any way to get tabular information on which ones get clustered together(cluster number)?So that I can have a table having the cluster number and its assigned items,after we have calculated " fit <- hclust(d, method="ward")" – Rgeek Feb 20 '14 at 04:25