Compare two ffdf

Question

I have two very large data sets (50M rows, 130 columns) which i can't compare with basic packages. Therefore i have to use an ffdf. It's the first time i am working with the ff package. I am trying to compare two ffdf and then write the differences in two outputfile ("in_file1_not_in_file2", "in_file2_not_in_file1"). Here is an example:

# For easy reproduction; normally a CSV file
set.seed(1234)
data1 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:8, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))
data2 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:10, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))

# Convert to an ffdf
ffdata1 <- as.ffdf(data1)
ffdata2 <- as.ffdf(data2)

So now i am stuck. Normally i would combine all rows in one column and compare this with each other. Something like this:

# Normally - Combined columns
data1$CCID <- apply(data1, 1, paste, collapse='.')
data2$CCID <- apply(data2, 1, paste, collapse='.')

# Combine columns of ffdf?
ffdata1$CCID <- ??
ffdata2$CCID <- ??

# Normally - Comparison
cdata3 <- sapply(data2$CCID, FUN=function(x) { x == data1$CCID })
output1 <- data2[colSums(cdata3)>0,]
output2 <- data1[rowSums(cdata3)>0,]

# Comparison of ffdf?
ffcdata3 <- ??
ffoutput1 <- ??
ffoutput2 <- ??

I hope it is understandable and sorry that i have just no idea how to work with ffdf.

Does the order matter or are you just checking to see if there are overlapping combinations in your CCID columns? I.e., does row 1 need to match with row 1, or can row 1 have a match with row 5? — Andrew, Feb 13 '19 at 14:38
Looks like `ffapply()` is a good place to start if you want to keep things in the ff package. Otherwise you could look into collapsing __# of rows iteratively so it reduces the burden on memory. — Andrew, Feb 13 '19 at 15:28
The order doesn't matter. So row 1 in file1 could match with row 5 in file2. I have played a bit with `ffapply()` but i couldn't get it to work. — Marvelous, Feb 13 '19 at 17:25
I am sorry I am not much more help. Because I am unfamiliar with ff, the only thing I could think to do would be to iterate `do.call(paste0, data1)` for chunks of rows and `c()` the unique combinations together into one string. Then use `%in%` to figure out what values are in data1 that are not in data2 and vice versa — Andrew, Feb 14 '19 at 14:21

Compare two ffdf

0 Answers0