0

I rewrote my program many times to not hit any memory limits. It again takes up full VIRT which does not make any sense to me. I do not save any objects. I write to disk each time I am done with a calculation.

The code (simplified) looks like


 lapply(foNames, # these are just folder names like ["~/datastes/xyz","~/datastes/xyy"]
        function(foName){
     Filepath <- paste(foName,"somefile,rds",sep="")
     CleanDataObject <- readRDS(Filepath) # reads the data

     cl <- makeCluster(CONF$CORES2USE) # spins up a cluster (it does not matter if I use the cluster or not. The problem is intependent imho)

     mclapply(c(1:noOfDataSets2Generate),function(x,CleanDataObject){
                                            bootstrapper(CleanDataObject)
                                         },CleanDataObject)
     stopCluster(cl)
 })

The bootstrap function simply samples the data and save the sampled data to disk.

bootstrapper <- function(CleanDataObject){

   newCPADataObject <- sample(CleanDataObject)
   newCPADataObject$sha1 <- digest::sha1(newCPADataObject, algo="sha1")

   saveRDS(newCPADataObject, paste(newCPADataObject$sha1 ,".rds", sep = "") )

   return(newCPADataObject)
}

I do not get how this can now accumulate to over 60 GB of RAM. The code is highly simplified but imho there is nothing else which could be problematic. I can paste more code details if needed.

How does R manage to successively eat up my memory, even though I already re-wrote the software to store the generated object on disk?

kn1g
  • 358
  • 3
  • 16

1 Answers1

0

I have had this problem with loops in the past. It is more complicated to address in functions and apply.

But, what I have done is used two things in combination to fix the problem.

Within each function that generates temporary files, use rm(file-name) to remove the temp file and then run gc() which forces a garbage collection before exiting the functions. This will slow the process some, but reduce memory pressure. This way each iteration of apply will purge before moving on to the next step. You may have to go back to your first function in nested functions to accomplish this well. It takes experimentation to figure out where the system is getting backed up.

I find this to be especially necessary if you use ANY methods called from packages built over rJava, it is extremely wasteful of resources and R has no way of running garbage collection on the Java heap, and most authors of java packages do not seem to be accounting for the need to collect in their methods.

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • Thanks, I right now try to profile the mem usage by printing ```sapply(ls(), function(x){ print("Sec Loop") print( object.size(get(x)) ) })``` at the end of each function to find the objects that grow in memory. Will then use your approach to clean up most likely. I keep you posted if I solved it. – kn1g Mar 29 '20 at 15:24
  • Are you confusing *variables* and *files*? – Konrad Rudolph Mar 29 '20 at 15:42
  • Sorry, how do you mean this? I see the problem as follows: 1. I load an object from disk into Memory 2. I process the object and create a new object by sampling 3. I directly save this object to disk 4. I create a new object by sampling and so on... after x sampling steps, it loads a new object from disk and samples this object x times and writes the sampled object to disk, Hence, in my opinion somewhere some object grows in memory. And I do not know which one. – kn1g Mar 29 '20 at 16:01
  • After each read, and before the next iteration `rm()` the last file and `gc()` – sconfluentus Mar 29 '20 at 16:06
  • Ok, thanks I will try your approach now. Because my memory profiling did not yield any results. All the objects I tracked via ls() where fine. None of them was getting bigger. – kn1g Mar 29 '20 at 16:08