2

I have a dataframe with loci names in one column and DNA sequences in the other. I'm trying to use as.DNAbin{ape} or similar to create a DNAbin object.

Here some example data:

x <- structure(c("55548", "43297", "35309", "34468", "AATTCAATGCTCGGGAAGCAAGGAAAGCTGGGGACCAACTTCTCTTGGAGACATGAGCTTAGTGCAGTTAGATCGGAAGAGCA", "AATTCCTAAAACACCAATCAAGTTGGTGTTGCTAATTTCAACACCAACTTGTTGATCTTCACGTTCACAACCGTCTTCACGTT", "AATTCACCACCACCACTAGCATACCATCCACCTCCATCACCACCACCGGTTAAGATCGGAAGAGCACACTCTGAACTCCAGTC", "AATTCTATTGGTCATCACAATGGTGGTCCGTGGCTCACGTGCGTTCCTTGTGCAGGTCAACAGGTCAAGTTAAGATCGGAAGA"), .Dim = c(4L, 2L))

If I try y <- as.DNA(x) R creates a sort of DNAbin object with 4 DNA sequences (the 4 rows of the example) of length 2 (the two columns, I assume), there is no labels and of course the base composition doesn't work either.

The documentation is not very clear, but after playing with the woodmouse example data of the package I think that what I need to do is to create a matrix with each base as a column and then use as.DNAbin. I.e. in the above example a 4 x 84 matrix (1 column for locus name and 83 for the sequences?). Any advice on how to do this? Or any better idea?

Thanks

C_Z_
  • 7,427
  • 5
  • 44
  • 81
A.Mstt
  • 301
  • 1
  • 3
  • 15

1 Answers1

3

First parameter of as.DNAbin should be a matrix or a list containing the DNA sequences, or an object of class "alignment". So, your idea is right.

Given x is the structure from original post, the code below prepares matrix y:

y <- t(sapply(strsplit(x[,2],""), tolower))
rownames(y) <- x[,1]

Then as.DNAbin(y) shows:

4 DNA sequences in binary format stored in a matrix.

All sequences of same length: 83 

Labels: 55548 43297 35309 34468 

Base composition:
    a     c     g     t 
0.289 0.262 0.205 0.244 
redmode
  • 4,821
  • 1
  • 25
  • 30
  • Many thanks, this works. Just small edit (so small it doesn't let me do it: in intro parragraph it should say `as.DNAbin` not `as.DNA` – A.Mstt Jan 14 '14 at 16:05