2

I'm trying to read in a large (3.7 million rows, 180 columns) dataset into R, using the ff package. There are several data types in the dataset - factor, logical, and numeric.

The problem is when reading in numeric variables. For example, one of my columns is:

TotalBeforeTax
126.9
88.0
124.5
90.9
...

When I try reading the data in, the following error is thrown:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"126.90000"'

I tried declaring the class to integer (it's already declared as numeric) using the colClasses argument, but to no avail. I also tried changing it to a real (whatever that is supposed to mean), and it starts reading in the data, but at some point throws:

Error in methods::as(data[[i]], colClasses[i]) : 
  no method or default for coercing “character” to “a real”

(My guess is, because it comes across an NA and doesn't know what to do with it.)

The funny thing is, if I declare the column as a factor, everything reads in nicely.

What gives?

neuron
  • 551
  • 3
  • 9
  • 16
  • See also http://stackoverflow.com/questions/22357396/ff-in-r-no-applicable-method-for-recodelevels –  Apr 04 '14 at 16:49

3 Answers3

2

OK, so I managed to solve this using a primitive workaround. First, split the .csv file using a csv file splitter application. Then, execute the following code:

## First, set the folder where the split .csv files are. Set the file names.

sourceDir <- "split_files_folder"
sourceFile <- paste(sourceDir,"common_name_of_split_files", sep = "/")

## Now set the number of split pieces.

pieces <- "some_number"

## Set the destination folder for the tab-delimited text files. 
## Set the output file name.

destDir <- "destination_folder"
destFile <- paste(paste(destDir, "datafile", sep = "/"), "txt", sep = ".")

## Now, initialize the loop.

for (i in 1:pieces)
{
  temp <- read.csv(file = paste(paste(sourceFile, i, sep = "_"), "csv", sep = "."))
  if (i == 1) 
  {
    write.table(temp, file = destFile, quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
  }
  else 
  {
    write.table(temp, file = destFile, append = TRUE, quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
  }
}

And voila! You've got a huge tab-delimited text file!

neuron
  • 551
  • 3
  • 9
  • 16
  • 1
    Thanks for the answer, @neuron. To improve the speed of the loop, I would suggest using fread() from data.table package, instead of read.csv . The function fread probably the fastest method to read a dataset, as shown by these benchmarks https://rpubs.com/dpastoor/benchmark-nm-read and – rafa.pereira Aug 20 '15 at 13:58
1

Solution 1

You could try laf_to_ffdf from the ffbase package. Something like:

library(LaF)
library(ffbase)

con <- laf_open_csv("yourcsvfile.csv", 
  column_names = [as character vector with column names], 
  column_types = [a character vector with colClasses], 
  dec=".", sep=",", skip=1)

ffdf <- laf_to_ffdf(con)

Or if you want to detect the types automatically:

library(LaF)
library(ffbase)

m <- detect_dm_csv("yourcsvfile.csv")
con <- laf_open(m)
ffdf <- laf_to_ffdf(con)

Solution 2

Use a column class of character for the offending column and cast the column to numeric in transFUN argument of read.csv.ffdf:

ffdf <- read.csv.ffdf([your regular arguments], transFUN = function(d) {
  d$offendingcolumn <- as.numeric(d$offendingcolumn)
  d
})
Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
  • I tried solution 2. Unfortunately, the read function doesn't support characters (you can check with `.vimplemented`), and throws an error. I tried loading them as factors and converting them back with `transFUN` to numerics, but that gives the wrong values. – neuron Apr 07 '14 at 09:31
  • So, I managed to solve it using a primitive workaround. I used a csv splitter application to break up the file into manageable chunks of 50,000 rows each. Then I wrote an R script that would load a chunk, and then export it as a tab-delimited text file, than load the next chunk, export it and appen the output to the already generated text file, and so on. The `read.delim.ffdf` function didn't cause any issues when loading numerical or integer values. – neuron Apr 08 '14 at 10:42
  • @ssantic Too bad the second solution didn't work. That has probably to do with the fact that `read.csv.ffdf` doesn't like it when the colClasses change. And the first (possible) solution? – Jan van der Laan Apr 09 '14 at 06:37
  • @ssantic If you have a working solution, I would add it as an answer. – Jan van der Laan Apr 09 '14 at 06:38
  • Honestly, the first one took forever to load the file (and there is no `VERBOSE` argument, so I stopped it at some point). I'll post my solution as an answer. – neuron Apr 09 '14 at 06:51
0

The problem seems to be the number 126.9000 being surrounded by a quote ". So maybe you should first get the variable as character and secondly remove all unwanted character, and finally convert the variable to numeric.

SeDur
  • 164
  • 4
  • I thought that as well, but when I use the plain `read.csv` function to read in, say, the first several thousand lines, it works like a charm. Plus, I'm not sure I can change the types of columns in an `ff` data frame the same way I can in a regular one. – neuron Apr 04 '14 at 11:07
  • @ssantic There was a discussion on this problem some time ago on the r-devel list: https://stat.ethz.ch/pipermail/r-devel/2013-September/067605.html. Not that a solution is given there... – Jan van der Laan Apr 04 '14 at 12:26