7

I am looking for a fast way to extract a large number of rows from an even larger table. The top of my table is as follows:

> head(dbsnp)

      snp      gene distance
rs5   rs5     KRIT1        1
rs6   rs6   CYP51A1        1
rs7   rs7 LOC401387        1
rs8   rs8      CDK6        1
rs9   rs9      CDK6        1
rs10 rs10      CDK6        1

And the dimensions:

> dim(dbsnp)
[1] 11934948        3

I want to select the rows that have the rownames contained in a list:

> head(features)
[1] "rs1367830" "rs5915027" "rs2060113" "rs1594503" "rs1116848" "rs1835693"

> length(features)
[1] 915635

Not surprisingly, the straightforward way of doing this temptable = dbsnp[features,] takes quite a long time.

I've been looking into ways to do this through the sqldf package in R. I thought that that might be faster. Unfortunately, I can't figure out how to select rows with certain rownames in SQL.

Thanks.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
Gordon Freeman
  • 125
  • 2
  • 5
  • 3
    check out the `data.table` package. you can do this using keys. Are your "row names" identical to the `snp` column? – Justin Aug 30 '12 at 19:36
  • Yes, the two are the same. I'll look into data.table. – Gordon Freeman Aug 30 '12 at 19:38
  • 1
    Just out of interest, does anyone know whether there's any time difference if you get the row numbers first with some mechanism like `foo<-rownames(dbsnp); bar<-which(foo%in%features,arr.ind=TRUE); temptable<-dbsnp[c(bar),] ` (I know that's not correct R syntax for which, but you get the idea). – Carl Witthoft Aug 30 '12 at 19:57

3 Answers3

10

The data.table solution:

library(data.table)
dbsnp <- structure(list(snp = c("rs5", "rs6", "rs7", "rs8", "rs9", "rs10"
), gene = c("KRIT1", "CYP51A1", "LOC401387", "CDK6", "CDK6", 
"CDK6"), distance = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("snp", 
"gene", "distance"), class = "data.frame", row.names = c("rs5", 
"rs6", "rs7", "rs8", "rs9", "rs10"))

DT <- data.table(dbsnp, key='snp')
features <- c('rs5', 'rs7', 'rs9')
DT[features]

   snp      gene distance
1: rs5     KRIT1        1
2: rs7 LOC401387        1
3: rs9      CDK6        1
Justin
  • 42,475
  • 9
  • 93
  • 111
5

Using sqldf you will need rownames = TRUE then you can query on the rownames using row_names:

library(sqldf)

## input

test<-read.table(header=T,text="      snp      gene distance
rs5   rs5     KRIT1        1
rs6   rs6   CYP51A1        1
rs7   rs7 LOC401387        1
rs8   rs8      CDK6        1
rs9   rs9      CDK6        1
rs10 rs10      CDK6        1
")
features<-c("rs5","rs7","rs10")

## calculate

inVar <- toString(shQuote(features, type = "csh")) # 'rs5','rs7','rs10'

fn$sqldf("SELECT * FROM test t
          WHERE t.row_names IN ($inVar)"
           , row.names = TRUE)

## result
#      snp      gene distance
#rs5   rs5     KRIT1        1
#rs7   rs7 LOC401387        1
#rs10 rs10      CDK6        1

UPDATE: Alternately if fet is a data frame whose features column contains the required items to find:

fet <- data.frame(features)
sqldf("SELECT t.* FROM test t
          WHERE t.row_names IN (SELECT features FROM fet)"
           , row.names = TRUE)

Also if the data were sufficiently large we could speed it up using indexes. See the sqldf home page for this and other details.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
shhhhimhuntingrabbits
  • 7,397
  • 2
  • 23
  • 23
4

The way most people would initially try it would be:

dbsnp[ rownames(dbsnp) %in% features, ]  # which is probably slower than your code

Because you say this is taking a long time, I suspect you have exceeded your RAM capacity and have started using virtual memory. You ought to shutdown your system and then restart with just R as a running application and see if you can avoid "going virtual".

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I tried selecting the rows as you suggested it would 'usually' be done. It's much faster (I didn't realize you can do %in% from lists into lists). – Gordon Freeman Aug 30 '12 at 19:58
  • From what you displayed, I think both of those objects were "atomic vectors" rather than R "lists". I admit to being a bit surprised that it was faster. – IRTFM Aug 30 '12 at 20:01
  • Cool - I'll keep that distinction in mind next time. I actually just learned the difference between the two from your post here: http://stackoverflow.com/questions/8594814/what-are-the-differences-between-r-vector-and-r-list-data-types – Gordon Freeman Aug 30 '12 at 20:16