Memory problems using bigmemory to load large dataset in R

Question

I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix object. Here is a simplified example:

library(bigmemory)
library(pryr)

con  <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')

for (i in 1:5){
   print(c(address(x), refs(x)))
   y <- readLines(con, n = 1, warn = FALSE)
   x[i] <- 2L*as.integer(y)
} 

close(con)

where x.csv contains

Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration:

[1] "0x101e854d8" "2"          
[1] "0x101d8f750" "2"          
[1] "0x102380d80" "2"          
[1] "0x105a8ff20" "2"          
[1] "0x105ae0d88" "2"

Can big.matrix objects be modified in place?
is there a better way to load, process and then save these data? The current method is slow!

I would check out the `data.table` package. It is designed for "big" data (or least larger data). — Richard Erickson, Jul 14 '15 at 13:04
General comment: R operates completely in memory, so if your final data set is more than the memory allocated for your R console, you may have a problem. — Tim Biegeleisen, Jul 14 '15 at 13:08
@Tim, I will try filebacked.big.matrix() to avoid exceeding my RAM allocation. — slabofguinness, Jul 14 '15 at 13:12
@Richard I'm not aware of an option in read.table to process input one line at a time. — slabofguinness, Jul 14 '15 at 13:13
@SeamusO'Bairead, With `data.table` you might be able to avoid reading in your data one line at a time. Also, using a `data.table` __might__ handle memory better than `big.matrix`. Last, if I recall, the `fread` function on wasn't working so I had to install `data.table` from git hib. If you try `fread` and it doesn't work, that would be my first trouble shooting step. — Richard Erickson, Jul 14 '15 at 14:15

Steve Bronder · Answer 1 · 2015-07-15T14:37:29.810

is there a better way to load, process and then save these data? The current method is slow!

The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.

Here's the plan:

Figure out how many lines we have in a file
Read in a chunk of those lines
Perform some operation on that chunk

Push that chunk back into a new file to save for later

library(readr) 
# Make a file
x <- data.frame(matrix(rnorm(10000),100000,10))

write_csv(x,"./test_set2.csv")

# Create a function to read a variable in file and double it
calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
read.size=500000,variable="X1"){
  # Set up variables
  num.lines <- 0
  lines.per <- NULL
  var.top <- NULL
  i=0L

  # Gather column names and position of objective column
  connection.names <- file(calc.file,open="r+")
  data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
  close(connection.names)
  col.name <- which(colnames(data.names)==variable)

  #Find length of file by line
  connection.len <- file(calc.file,open="r+")
  while((linesread <- length(readLines(connection.len,read.size)))>0){

    lines.per[i] <- linesread
    num.lines <- num.lines + linesread
    i=i+1L 
  }
  close(connection.len)

  # Make connection for doubling function
  # Loop through file and double the set variables
  connection.double <- file(calc.file,open="r+")
  for (j in 1:length(lines.per)){

    # if stops read.table from breaking
    # Read in a chunk of the file
    if (j == 1) {
      data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
    } else {
      data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
    }
      # Grab the columns we need and double them
      double <- data[,I(col.name)] * 2
    if (j != 1) {
      write_csv(data.frame(double),outputFile,append = TRUE)
    } else {
      write_csv(data.frame(double),outputFile)
    }

    message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
  }
  close(connection.double)
}

calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")

So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk.

@Steve_Corring thanks I'll do some benchmarks and get back to you. Also write_csv should be write.csv. — slabofguinness, Jul 14 '15 at 18:20
@Steve_Corrin, you are correct that the file reading by line is the slow part but your first statement about in-place modification is incorrect. A `big.matrix` refers to a block of memory that has been mapped. When you change an element you change it in that memory block so any other objects pointing to that memory reflect the change. The memory address that is returned during each iteration is a copy of the pointer, not the matrix itself. — cdeterman, Jul 15 '15 at 14:33
@cdeterman My mistake. I assumed big.matrix used some sort of out-of-memory approach to handle large data sets. Just looked over the docs and removed my answer to the first question. Thanks for the check — Steve Bronder, Jul 15 '15 at 14:40
@Steve_Corrin, if using the `filebacked.backed.matrix` option then it is out-of-memory. However, it is designed to still modify that matrix in-place (given that the `filebacked` object refers to a separate temp file). The idea is to create as little overhead as possible, prevent copies (unless explicit) and allow for objects larger than available RAM. — cdeterman, Jul 15 '15 at 14:45

Memory problems using bigmemory to load large dataset in R

1 Answers1