4

I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix object. Here is a simplified example:

library(bigmemory)
library(pryr)

con  <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')

for (i in 1:5){
   print(c(address(x), refs(x)))
   y <- readLines(con, n = 1, warn = FALSE)
   x[i] <- 2L*as.integer(y)
} 

close(con)

where x.csv contains

4
18
2
14
16

Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration:

[1] "0x101e854d8" "2"          
[1] "0x101d8f750" "2"          
[1] "0x102380d80" "2"          
[1] "0x105a8ff20" "2"          
[1] "0x105ae0d88" "2"   
  1. Can big.matrix objects be modified in place?

  2. is there a better way to load, process and then save these data? The current method is slow!

slabofguinness
  • 773
  • 1
  • 9
  • 19
  • 3
    I would check out the `data.table` package. It is designed for "big" data (or least larger data). – Richard Erickson Jul 14 '15 at 13:04
  • 1
    General comment: R operates completely in memory, so if your final data set is more than the memory allocated for your R console, you may have a problem. – Tim Biegeleisen Jul 14 '15 at 13:08
  • @Tim, I will try filebacked.big.matrix() to avoid exceeding my RAM allocation. – slabofguinness Jul 14 '15 at 13:12
  • @Richard I'm not aware of an option in read.table to process input one line at a time. – slabofguinness Jul 14 '15 at 13:13
  • 1
    @SeamusO'Bairead, With `data.table` you might be able to avoid reading in your data one line at a time. Also, using a `data.table` __might__ handle memory better than `big.matrix`. Last, if I recall, the `fread` function on wasn't working so I had to install `data.table` from git hib. If you try `fread` and it doesn't work, that would be my first trouble shooting step. – Richard Erickson Jul 14 '15 at 14:15

1 Answers1

2
  1. is there a better way to load, process and then save these data? The current method is slow!

The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.

Here's the plan:

  1. Figure out how many lines we have in a file
  2. Read in a chunk of those lines
  3. Perform some operation on that chunk
  4. Push that chunk back into a new file to save for later

    library(readr) 
    # Make a file
    x <- data.frame(matrix(rnorm(10000),100000,10))
    
    write_csv(x,"./test_set2.csv")
    
    # Create a function to read a variable in file and double it
    calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
    read.size=500000,variable="X1"){
      # Set up variables
      num.lines <- 0
      lines.per <- NULL
      var.top <- NULL
      i=0L
    
      # Gather column names and position of objective column
      connection.names <- file(calc.file,open="r+")
      data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
      close(connection.names)
      col.name <- which(colnames(data.names)==variable)
    
      #Find length of file by line
      connection.len <- file(calc.file,open="r+")
      while((linesread <- length(readLines(connection.len,read.size)))>0){
    
        lines.per[i] <- linesread
        num.lines <- num.lines + linesread
        i=i+1L 
      }
      close(connection.len)
    
      # Make connection for doubling function
      # Loop through file and double the set variables
      connection.double <- file(calc.file,open="r+")
      for (j in 1:length(lines.per)){
    
        # if stops read.table from breaking
        # Read in a chunk of the file
        if (j == 1) {
          data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
        } else {
          data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
        }
          # Grab the columns we need and double them
          double <- data[,I(col.name)] * 2
        if (j != 1) {
          write_csv(data.frame(double),outputFile,append = TRUE)
        } else {
          write_csv(data.frame(double),outputFile)
        }
    
        message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
      }
      close(connection.double)
    }
    
    calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
    

So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk.

Steve Bronder
  • 926
  • 11
  • 17
  • @Steve_Corring thanks I'll do some benchmarks and get back to you. Also write_csv should be write.csv. – slabofguinness Jul 14 '15 at 18:20
  • @Steve_Corring Ah ok I see! – slabofguinness Jul 14 '15 at 18:36
  • 1
    @Steve_Corrin, you are correct that the file reading by line is the slow part but your first statement about in-place modification is incorrect. A `big.matrix` refers to a block of memory that has been mapped. When you change an element you change it in that memory block so any other objects pointing to that memory reflect the change. The memory address that is returned during each iteration is a copy of the pointer, not the matrix itself. – cdeterman Jul 15 '15 at 14:33
  • @cdeterman My mistake. I assumed big.matrix used some sort of out-of-memory approach to handle large data sets. Just looked over the docs and removed my answer to the first question. Thanks for the check – Steve Bronder Jul 15 '15 at 14:40
  • 1
    @Steve_Corrin, if using the `filebacked.backed.matrix` option then it is out-of-memory. However, it is designed to still modify that matrix in-place (given that the `filebacked` object refers to a separate temp file). The idea is to create as little overhead as possible, prevent copies (unless explicit) and allow for objects larger than available RAM. – cdeterman Jul 15 '15 at 14:45