I have two very large data sets (50M rows, 130 columns) which i can't compare with basic packages. Therefore i have to use an ffdf. It's the first time i am working with the ff package. I am trying to compare two ffdf and then write the differences in two outputfile ("in_file1_not_in_file2", "in_file2_not_in_file1"). Here is an example:
# For easy reproduction; normally a CSV file
set.seed(1234)
data1 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:8, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))
data2 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:10, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))
# Convert to an ffdf
ffdata1 <- as.ffdf(data1)
ffdata2 <- as.ffdf(data2)
So now i am stuck. Normally i would combine all rows in one column and compare this with each other. Something like this:
# Normally - Combined columns
data1$CCID <- apply(data1, 1, paste, collapse='.')
data2$CCID <- apply(data2, 1, paste, collapse='.')
# Combine columns of ffdf?
ffdata1$CCID <- ??
ffdata2$CCID <- ??
# Normally - Comparison
cdata3 <- sapply(data2$CCID, FUN=function(x) { x == data1$CCID })
output1 <- data2[colSums(cdata3)>0,]
output2 <- data1[rowSums(cdata3)>0,]
# Comparison of ffdf?
ffcdata3 <- ??
ffoutput1 <- ??
ffoutput2 <- ??
I hope it is understandable and sorry that i have just no idea how to work with ffdf.