Splitting an ffdf object

Question

I'm using ff and ffbase libraries to manage a big csv file (~40Go and 275e6 observations). I'd like to split/partition this file according to one of its columns (which is a factor column).

With a normal data frame, I would do something like that:

a <- data.frame(rnorm(10000,0,1),
                sample(1:100,10000,replace=T),
                sample(letters,10000,replace = T))
names(a) <- c('V1','V2','V3')
a_partition <- split(a,a$V3)
names(a_partition) <- paste("df",names(a_partition),sep = "_")
list2env(a_partition,globalenv())

but ff and ffbase doesn't have a split function. So, looking in the ffbase documentation, I found ffdfply and tried to use it as follows:

ffa <- as.ffdf(a)
ffa_partititon <- ffdfdply(x = ffa,split = ffa$V3)

Alas, I get the log message :

calculating split sizes
building up split locations
working on split 1/1, extracting data in RAM of 26 split elements,
totalling, 0.00015 GB, while max specified
data specified using BATCHBYTES is 0.01999 GB
... applying FUN to selected data
Error: argument "FUN" is missing, with no default

I tried FUN = as.data.frame (since the result of the function must be a data frame) with no luck : doing so makes ffa_partition a copy of ffa...

How can I partition my ffdf?

Hi, if you look at the help of `ffdfapply`, you will see there is a third mandatory argument `FUN` which you miss in your call, hence the error message. I would try FUN=as.data.frame — Eric Lecoutre, May 20 '16 at 12:53
@Eric Lecoutre : yes, I tried that but it does nothing (literally, it return the ffdf I put in the entry...) — G. Lombardo, May 22 '16 at 10:05

score 1 · Accepted Answer · answered Oct 29 '18 at 18:59

1

Two years late, but I believe this does what you needed:

result_list <- list()
for(letter in letters){
    result_list[[letter]] <- subset(ffa, V3 == letter)
}

answered Oct 29 '18 at 18:59

pedrostrusso

388
3
10

Splitting an ffdf object

1 Answers1