How to specify colClasses when reading a very big csv file into R using read.table.ffdf?

Question

I am trying to read a very big .csv file, of size around 20G, using the function read.table.ffdf() in the "ff" package, but had trouble in specifying the colClasses option in read.csv().

I have to specify the colClasses option because some columns in the file are labels as very long integers, e.g. with 11 digits. For example, two rows in the file are

86246,205,17,1719,104116343,8435,2013-03-13,12,OZ,1,2.59
86246,205,17,1719,10800749282,8435,2013-03-13,12,OZ,1,2.59

The integer 10800749282 is too large for the type "integer" and can only be handled as either "numeric" or "character". But the value 104116343 in the above row is not large enough, so R by default will treat this column being "integer".

I tried the following but got an error. Does anyone know how to solve this problem? Highly appreciated!

dat <- read.table.ffdf(file="file.csv", FUN = "read.csv", na.strings = "", colClasses="character")

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented

That value is not larger than the upper limit for "long integers". That upper limit is set by the 53 bit mantissa in numeric class. `10800749282L < 2^53-1 [1] TRUE` — IRTFM, Apr 25 '14 at 21:39
He's right though that the 'integer' type in R will not handle 10800749282L (```.Machine$integer.max``` is 2147483647) so if read.table tries to read that column as ```integer``` instead of ```numeric``` because of the first row, it's trouble. What I'm not sure about @user3574507, why not specify colClasses="numeric"? — sebkopf, Apr 25 '14 at 21:58

score 0 · Answer 1 · answered Jul 01 '14 at 10:21

As your error suggests, there is no 'character' data type implemented within the ff environment. All characters should be treated as factors. Assuming your file contains x number of columns, the below works:

dat <- read.csv.ffdf(NULL, file="file.csv", na.strings = "", colClasses=rep("factor", x))

However, what you probably need is not to import all data as factors, as it is extremely inefficient. Just import all your numerical data as 'numeric'. Assuming your first 5 columns are numeric and the rest 3 are characters:

dat <- read.csv.ffdf(NULL, file="file.csv", na.strings = "", colClasses=c(rep("numeric", 5), rep("factor", 3)))

How to specify colClasses when reading a very big csv file into R using read.table.ffdf?

1 Answers1