3

This is my first time working with R.

I have a table with 3 columns and 12090 rows (156 bacteria). The first two columns are names of bacteria and the last column is a number indicating relatedness between the organisms (based on a kind of genome similarity). example would be (made up numbers):

bacteria1    bacteria2    0.25846
bacteria1    bacteria3    0.35986
bacteria2    bacteria1    0.57896
bacteria2    bacteria3    0.54596
bacteria3    bacteria1    0.23659
bacteria3    bacteria2    0.36528

I would like to be able to neighbor join these into a phylogenetic tree of sorts. I see that 'nj' needs a distance matrix to do this. How would I convert this into a distance matrix or usable format? (The numbers are already distance so there shouldn't be any math being done) I've tried as.dist() and as.matrix() and reshape() but being new I may have done everything wrong. (reshape may be what I need..)

Or if anyone knows how to make these into a tree through other means that would be grand.

Thanks for any help.

Binnie
  • 313
  • 1
  • 3
  • 12

2 Answers2

2

Using the library reshape2 (which is distinct from the reshape function in base R, and, I think, a lot

# Load the library (after installing it, of course)
library(reshape2)

# Load up your data - for future reference, it's always helpful to post your data
# with a question.  I used dput(x) to generate this structure below:
x <- structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 3L, 3L), 
     .Label = c("bacteria1", "bacteria2", "bacteria3"),
     class = "factor"), V2 = structure(c(2L, 3L, 1L, 3L, 1L, 2L),
     .Label = c("bacteria1", "bacteria2", "bacteria3"), class = "factor"),
     V3 = c(0.25846, 0.35986, 0.57896, 0.54596, 0.23659, 0.36528)),
     .Names = c("V1", "V2", "V3"), class = "data.frame",
     row.names = c(NA, -6L))

# Recast it - acast returns a matrix with V1 as the records, V2 as the columns,
# and V3 as the values
distmat <- acast(x, V1 ~ V2, value.var = "V3")
Matt Parker
  • 26,709
  • 7
  • 54
  • 72
2

It sounds like you have either the upper or lower triangular portion of the distance matrix, but without the dimensions. (Although are you sure you have 156 rows? If there are 18 species of bacteria, there should be choose(18,2) = 153 entries, not 156.)

Assuming you really have 153 rows in your table, you can fill in the matrix thusly:

m <- matrix(nrow=18, ncol=18)
m[row(m) < col(m)] <- x         # if it's the upper triangular portion

or

m[row(m) > col(m)] <- x         # if it's the lower triangular portion

and then diag(m) <- 0 for the diagonal.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • sorry I meant 156 bacteria.. there are 12090 rows. How do I input it into the matrix? I don't see you calling the data it anywhere.. – Binnie Oct 12 '12 at 18:54
  • Where I have `x`, substitute the name of data frame and column containing your distances. If your data frame is `df` and the distance is `V3`, it would be `df$V3` for example. – Hong Ooi Oct 13 '12 at 12:15