0

I have a very large .csv file (~4GB) which I'd like to read, then subset.

The problem comes at reading (memory allocation error). Being that large reading crashes, so what I'd like is a way to subset the file before or while reading it, so that it only gets the rows for one city (Cambridge).

f:
        id City      Value
        1  London     17
        2  Coventry   21
        3  Cambridge  14
        ......

I've already tried the usual approaches:

f <- read.csv(f, stringsAsFactors=FALSE, header=T, nrows=100)

f.colclass <- sapply(f,class)

f <- read.csv(f,sep = ",",nrows = 3000000, stringsAsFactors=FALSE, 
                  header=T,colClasses=f.colclass)

which seem to work for up to 1-2M rows, but not for the whole file.

I've also tried subsetting at the reading itself using pipe:

f<- read.table(file = f,sep = ",",colClasses=f.colclass,stringsAsFactors   = F,pipe('grep "Cambridge" f ') )

and this also seems to crash.

I thought packages sqldf or data.table would have something, but no success yet !!

Thanks in advance, p.

user3310782
  • 811
  • 2
  • 10
  • 18
  • http://stackoverflow.com/questions/26861951/using-read-csv-sql-to-select-multiple-values-from-a-single-column might help – user20650 Nov 04 '15 at 12:33
  • 2
    Try fread in the data.table package which tends to use less memory than other approaches. Also read.csv.sql in the sqldf package reads the file into an sqlite database (which it creates for you) without gonig through R so that R's limitations do not apply and then pulls into R only that part that you specify. As long as the final result that you pull into R is sufficiently small then it can work. – G. Grothendieck Nov 04 '15 at 14:17
  • I will try fread today and let people know the outcome. Thanks! – user3310782 Nov 05 '15 at 09:59

3 Answers3

3

I think this was alluded to already but just in case it wasn't completely clear. The sqldf package creates a temporary SQLite DB on your machine based on the csv file and allows you to write SQL queries to perform subsets of the data before saving the results to a data.frame

library(sqldf)

query_string <- "select * from file where City=='Cambridge' "
f <- read.csv.sql(file = "f.csv", sql = query_string)

#or rather than saving all of the raw data in f, you may want to perform a sum

f_sum <- read.csv.sql(file = "f.csv",
 sql = "select sum(Value) from file where City=='Cambridge' " )
Joe R
  • 51
  • 5
1

One solution to this type of error is

  1. you can convert your csv file to excel file first.
  2. Then you can map your excel file into mysql table by using toad for mysql it is easy.Just check for datatype of variables.
  3. then using RODBC package you can access such a large dataset.

I am working with a datasets of size more than 20 GB this way.

Pankaj Sharma
  • 388
  • 7
  • 18
  • Taking that this works, only problem is: how do you convert .csv -> Excel? Since Excel cannot open the file either! Thanks – user3310782 Nov 04 '15 at 12:44
  • Use any convertor http://www.zamzar.com/convert/csv-to-xls/ .There are many available.Or you can directly export csv file into toad for mysql. – Pankaj Sharma Nov 04 '15 at 12:46
  • I've downloaded toad for mysql, and of course my sql too. I'm starting to explore it. Still a bit surprised that there are no methods to filter content before/while reading w/o having to use big weaponry!! Thanks in any case – user3310782 Nov 05 '15 at 09:29
  • user3310782, did you have a look at `read.csv.sql` in the link above and mentioned by Gabor (the package author) - it allow only subsetted data to be read into R – user20650 Nov 05 '15 at 23:27
1

Although there's nothing wrong with the existing answers, they miss the most conventional/common way of dealing with this: chunks (Here's an example from one of the multitude of similar questions/answers).

The only difference is, unlike for most of the answers that load the whole file, you would read it chunk by chunk and only keep the subset you need at each iteration

# open connection to file (mostly convenience)
file_location = "C:/users/[insert here]/..."
file_name = 'name_of_file_i_wish_to_read.csv'
con <- file(paste(file_location, file_name,sep='/'), "r")

# set chunk size - basically want to make sure its small enough that
# your RAM can handle it
chunk_size = 1000 # the larger the chunk the more RAM it'll take but the faster it'll go    

i = 0 # set i to 0 as it'll increase as we loop through the chunks

# loop through the chunks and select rows that contain cambridge
repeat {
  # things to do only on the first read-through 
  if(i==0){
    # read in columns only on the first go
    grab_header=TRUE
    # load the chunk
    tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header)
    
    # subset only to desired criteria
    cond = tmp_chunk[,'City'] == "Cambridge"
    
    # initiate container for desired data
    df = tmp_chunk[cond,] # save desired subset in initial container
    cols = colnames(df) # save column names to re-use on next chunks
  }

  # things to do on all subsequent non-first chunks
  else if(i>0){
    grab_header=FALSE
    tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header,col.names = cols)

    # set stopping criteria for the loop
    # when it reads in 0 rows, exit loop
    if(nrow(tmp_chunk)==0){break}
    
    # subset only to desired criteria
    cond = tmp_chunk[,'City'] == "Cambridge"        
    
    # append to existing dataframe
    df = rbind(df, tmp_chunk[cond,])
  }
  # add 1 to i to avoid the things needed to do on the first read-in
  i=i+1
}

close(con) # close connection

# check out the results
head(df)
Gene Burinsky
  • 9,478
  • 2
  • 21
  • 28