0

I have a large data frame which I want to filter and make a binary data frame for based on several conditions.

This is the original data frame:

a1 <- data.frame(
  ID = c(rep("ID_1",3),rep("ID_2",3)),
  gene = c("A", "D", "X","D","D","A"),
  C = c("Q", "R", "S","S","R","Q"),
  D = c(8, 3, 3, 4, 5, 4),
  E = sample(c("silent","non-silent"),6,replace=T)
)

eg:

    ID  gene    C   D   E
1   ID_1    A   Q   8   non-silent
2   ID_1    D   R   3   silent
3   ID_1    X   S   3   silent
4   ID_2    D   S   4   non-silent
5   ID_2    D   R   5   silent
6   ID_2    A   Q   4   non-silent

I now have made an empty data frame with the IDs as columns and genes as rows as such:

dt=as.data.frame(matrix(NA, length(c(levels(a1$gene))), length(c(levels(a1$ID)))+1))
colnames(dt)[1] <- "gene"
dt[,"gene"]=c(levels(a1$gene))
colnames(dt)[-1]=levels(a1$ID)

    gene    ID_1    ID_2
1   A   NA  NA
2   D   NA  NA
3   X   NA  NA

Now I would want to put a 1 for genes that are present for each ID and 0 for those that are not present. I would later also want to include other conditions. For example only put a 1 for non-silent in the E column. Is there an R base way to do this or with a package such as data.table or ddply?

paul_dg
  • 511
  • 5
  • 16

2 Answers2

3

You can use dcast from the reshape2 package:

library(reshape2)
dcast(a1, gene ~ ID)
#   gene ID_1 ID_2
# 1    A    1    1
# 2    D    1    2
# 3    X    1    0

or

dcast(a1, gene ~ ID, fun.aggregate = function(x) (length(x) > 0L) * 1L)
#   gene ID_1 ID_2
# 1    A    1    1
# 2    D    1    1
# 3    X    1    0

It's also available for data tables.

lukeA
  • 53,097
  • 5
  • 97
  • 100
  • This is a nice solution, it works perfectly for the example. But I get some strange results for the original data. An error message: Using freq as value column: use value.var to override. And not all IDs are present in the binarytable. – paul_dg Jun 25 '14 at 11:51
  • I don't know about your real data, but to get rid of the warning (?) message just specify `value.var = "ID"` explicitly. – lukeA Jun 25 '14 at 11:59
  • Solved it, thank you! But what if I want to include other columns, to for example only have non-silent in the E column? – paul_dg Jun 25 '14 at 12:03
  • You could change the formula from `gene ~ ID` to `gene + E ~ ID`. – lukeA Jun 25 '14 at 12:23
  • 1
    if OP's data is a `data.table`, use `dcast.data.table` instead of `dcast` to get the best of both worlds – eddi Jun 25 '14 at 16:11
1

To see if a gene is present for each ID:

dt$ID_1 <- dt$gene %in% a1[a1$ID == "ID_1", ]$gene
dt$ID_2 <- dt$gene %in% a1[a1$ID == "ID_2", ]$gene

so dt$ID_1 & dt$ID_2 will give you those that are present in both.

If you have many IDs and you want to iterate over them, you can use e.g. lapply and if you want to apply it to other columns you just need to replace this string by a variable (and turn it into a function).

konvas
  • 14,126
  • 2
  • 40
  • 46