1

I have two very large data sets (50M rows, 130 columns) which i can't compare with basic packages. Therefore i have to use an ffdf. It's the first time i am working with the ff package. I am trying to compare two ffdf and then write the differences in two outputfile ("in_file1_not_in_file2", "in_file2_not_in_file1"). Here is an example:

# For easy reproduction; normally a CSV file
set.seed(1234)
data1 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:8, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))
data2 <- data.frame(row.names=1:10, var1=sample(c(TRUE,FALSE), 10, replace=TRUE), var2=sample(1:10, 10, replace=TRUE), var3=as.factor(sample(c('AAA','BBB','CCC'), 10, replace=TRUE)))

# Convert to an ffdf
ffdata1 <- as.ffdf(data1)
ffdata2 <- as.ffdf(data2)

So now i am stuck. Normally i would combine all rows in one column and compare this with each other. Something like this:

# Normally - Combined columns
data1$CCID <- apply(data1, 1, paste, collapse='.')
data2$CCID <- apply(data2, 1, paste, collapse='.')

# Combine columns of ffdf?
ffdata1$CCID <- ??
ffdata2$CCID <- ??

# Normally - Comparison
cdata3 <- sapply(data2$CCID, FUN=function(x) { x == data1$CCID })
output1 <- data2[colSums(cdata3)>0,]
output2 <- data1[rowSums(cdata3)>0,]

# Comparison of ffdf?
ffcdata3 <- ??
ffoutput1 <- ??
ffoutput2 <- ??

I hope it is understandable and sorry that i have just no idea how to work with ffdf.

adibender
  • 7,288
  • 3
  • 37
  • 41
Marvelous
  • 21
  • 1
  • Does the order matter or are you just checking to see if there are overlapping combinations in your CCID columns? I.e., does row 1 need to match with row 1, or can row 1 have a match with row 5? – Andrew Feb 13 '19 at 14:38
  • Looks like `ffapply()` is a good place to start if you want to keep things in the ff package. Otherwise you could look into collapsing __# of rows iteratively so it reduces the burden on memory. – Andrew Feb 13 '19 at 15:28
  • The order doesn't matter. So row 1 in file1 could match with row 5 in file2. I have played a bit with `ffapply()` but i couldn't get it to work. – Marvelous Feb 13 '19 at 17:25
  • I am sorry I am not much more help. Because I am unfamiliar with ff, the only thing I could think to do would be to iterate `do.call(paste0, data1)` for chunks of rows and `c()` the unique combinations together into one string. Then use `%in%` to figure out what values are in data1 that are not in data2 and vice versa – Andrew Feb 14 '19 at 14:21

0 Answers0