2

What I am trying to select the rows that have the same gene_id, but have the minimum value of the start coordinates: href_pos$start. Why do I get this error, even though I have a memory limit of ~ 16Gb? Or what am I doing wrong? I have the following code:

head(href_pos, 5)

    chr      region    start      end strand nu   gene_id
1  chr1 start_codon 67000042 67000044      +  . NM_032291
2  chr1         CDS 67000042 67000051      +  0 NM_032291
3  chr1        exon 66999825 67000051      +  . NM_032291
4  chr1         CDS 67091530 67091593      +  2 NM_032291
5  chr1        exon 67091530 67091593      +  . NM_032291

d1 <- ddply(as.data.frame(href_pos), "gene_id", function(href_pos) href_pos[which.min(href_pos$start), ])
Error: cannot allocate vector of size 283 Kb In addition: Warning messages:
1: In lapply(dfs, function(df) levels(df[[var]])) : Reached total allocation of 16383Mb: see help(memory.size)

Nanami
  • 3,319
  • 3
  • 19
  • 19
  • 3
    R is telling you that your 16 gigs of memory aren't enough to execute the `ddply` function. You don't tell us how big `href_pos` is, but it must be pretty sizable if you're using up 16 gigs of ram. In general, `plyr` does not perform well with large data. See [here](http://stackoverflow.com/questions/10748253/idiomatic-r-code-for-partitioning-a-vector-by-an-index-and-performing-an-operati/10748470#10748470) for some proof. Try `data.table` package for speed and efficiency. See [here](http://stackoverflow.com/questions/11564775/exceeding-memory-limit-in-r-even-with-24gb-ram/11564999#11564999) – Chase Oct 27 '12 at 18:45
  • When I apply > object.size(href_pos) I get **15157856 bytes** – Nanami Oct 27 '12 at 18:53
  • Note that the obvious suspect may not be the actual culprit in out-of-memory errors in R (although here it seems like it is). It is easy to fill up RAM with lots of objects, then forget to clean up, start a completely new analysis and run out of RAM, even though the new analysis would on its own not have the slightest memory problem. This kind of error is virtually not replicable and _extremely_ hard to find. I personally start _every_ R script with `rm(list=ls())` to avoid this situation. – Stephan Kolassa Oct 27 '12 at 19:33
  • I do that too and I also run gc(). But my table - href only occupies ~ 14.45 Mb. I am trying to figure out another alternative now, without using ddply(). – Nanami Oct 28 '12 at 09:33

1 Answers1

0

Proof that your syntax is fine:

#Create a minimal, reproducible example
gene_id <- gl(3, 3, 9, labels <- letters[1:3])
start <- rep(1:3, 3)
href_pos <- data.frame(gene_id=gene_id, start=start)

d1 <- ddply(as.data.frame(href_pos), "gene_id", function(href_pos) href_pos[which.min(href_pos$start), ])
 gene_id  start
1      a      1
2      b      1
3      c      1

To do it with data.table as Chase suggests, this should work:

require(data.table)
HREF_POS <- data.table(href_pos)
setkey(HREF_POS, gene_id)
MINS <- HREF_POS[HREF_POS[,start] %in% HREF_POS[ ,min(start), by=gene_id]$V1,]
Drew Steen
  • 16,045
  • 12
  • 62
  • 90
  • I have tried my synthax on a small matrix and it worked very nicely and fast. I tried the data.table() version and it works instantly. Thanks (: – Nanami Oct 28 '12 at 09:53