R script exhausting memory - Microsoft HPC Cluster

Question

I have an R script with the following source code:

genofile<-read.table("D_G.txt", header=T, sep=',')
genofile<-genofile[genofile$"GC_SCORE">0.15,]
cat(unique(as.vector(genofile[,2])), file="GF_uniqueIDs.txt", sep='\n')

D_G.txt is a huge file, about 5 GBytes.

Now, the computation is performed on a Microsoft HPC cluster so, as you know, when I submit the job it gets "splitted" across different physical nodes; in my case each one has 4 GB of RAM memory.

Well, after a variable amount of time, I get the infamous error cannot allocate vector of size xxx Mb message. I've tried to use a switch which limits the usable memory:

--max-memory=1GB

but nothing change.

I've tried Rscript 2.15.0 both 32 and 64 bit with no luck.

It looks like that such job doesn't need the cluster. You may read file and filter it line by line in this case. But if you need speed up of processing then you maybe need to split your file on separate chunks and process them on different nodes as advised by Paul. And of course add please some code how you launch cluster in R and send jobs to nodes. I think it will be helpful. — DrDom, Apr 17 '12 at 08:58
Thanks @DrDom, I've finally splitted the file (check my comment under Paul's answer) — Delta, Apr 17 '12 at 13:56

score 2 · Accepted Answer · answered Apr 17 '12 at 08:50

The fact that your dataset as such should fit in the memory of one node does not mean that when performing an analysis on it also means that it fits in memory. Often analyses cause copying of data. In addition, some inefficient programming from your side could also increase memory usage. Setting the switch and limiting the memory use of R only makes things worse. It does not limit the actual memory usage, it limits the maximum memory usage. And using a 32 bit OS is always a bit idea memory wise, as the maximum memory that can be addressed by a single process using a 32 bit OS is less than 4 GB.

Without more details it is hard to help you any further with this problem. In general I would recommend to cut the dataset up in smaller and smaller pieces, until you succeed. I assume that your problem is embarrassingly parallel, and cutting up your dataset further does not change anything to the output.

This is what I've come up with: - I've modified the script to allow the input / output files to be specified on the command line. - Splitted the input file into 10 chunks of ~500 MB each - Defined a sweep task under the Microsoft HPC Job Manager which starts an individual R instance for each input file, on different physical nodes. Everything is working perfectly now. Thanks a lot! — Delta, Apr 17 '12 at 13:54

R script exhausting memory - Microsoft HPC Cluster

1 Answers1