2

I'm trying to solve the problem I posted here using data.table package or other solutions dealing efficiently with big data (14-22 million rows). Any hints on how to speed this solution up or find a quicker workaround?

Thanks a lot for your help!

Sotos
  • 51,121
  • 6
  • 32
  • 66
Kasia Kulma
  • 1,683
  • 1
  • 14
  • 39
  • 1
    Your best bet is parallelization. Unfortunately, it'll take some work. Take a look at `parallel`, `doParallel`, `foreach` with `%dopar%`. The main idea is to split your data frame by user_id, and send each group to a parallel worker. – CPak Jul 03 '17 at 14:24
  • @Sotos, can you clarify what error message you're getting and what part of the code it refers to? I'm not getting any errors my end – Kasia Kulma Jul 03 '17 at 14:24
  • nvm, I missed a char in pasting :) – Sotos Jul 03 '17 at 14:26
  • @ChiPak, thanks for that, I want to think about `doParallel` as a last resort solution... – Kasia Kulma Jul 03 '17 at 14:29

3 Answers3

5

1) Lets multiply data:

d <- replicate(1e2, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, user_id := paste0(user_id, .id)]
      .id user_id start_date   end_date
   1:   1     121 2010-10-31 2011-10-31
   2:   1     121 2010-12-18 2011-12-18
   3:   1     121 2011-10-31 2014-04-28
   4:   1     121 2011-12-18 2014-12-18
   5:   1     121 2014-03-27 2015-03-27
  ---                                  
1296: 100   33100 1992-07-01 2016-07-01
1297: 100   33100 1993-08-20 2016-08-16
1298: 100   33100 1999-10-28 2012-11-15
1299: 100   33100 2006-01-31 2006-02-28
1300: 100   33100 2016-08-26 2017-01-26

2) write function from previous post:

yourFunction <- function(data){
  data %>%
    rowwise() %>%
    do(data_frame(user_id = .$user_id, 
                  Date = seq(.$start_date, .$end_date, by = 1))) %>%
    distinct() %>%
    ungroup() %>%
    count(user_id)
}

rez1 <- yourFunction(d)
rez1
    # A tibble: 200 x 2
   user_id     n
     <chr> <int>
 1     121  2606
 2    1210  2606
 3   12100  2606
 4    1211  2606
 5    1212  2606
 6    1213  2606
 7    1214  2606
 8    1215  2606
 9    1216  2606
10    1217  2606
# ... with 190 more rows

3) my data.table approach:

myFunction <- function(data){
  setDT(data)
  seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
  data[, n:= seq2(start_date, end_date)]
  d <- data[, .(day = unlist(n)), by = user_id]
  d[, .(n = uniqueN(day)), by = user_id]
}
rez2 <- myFunction(d)

3) Test if results are equal:

setDT(rez1)
setorder(rez1, user_id)
setorder(rez2, user_id)
all.equal(rez1, rez2)
[1] TRUE

4) BENCHMARKS:

cols <- c("test", "replications", "elapsed", "relative")
rbenchmark::benchmark(yourFunction(d),
                      myFunction(d), replications = 1, columns = cols)
             test replications elapsed relative
1 yourFunction(d)            1   10.23   42.625
2   myFunction(d)            1    0.24    1.000

5) Lets try with bigger data:

d <- replicate(1e5, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, .N]
[1] 1300000
d[, user_id := paste0(user_id, .id)]

system.time(rez3 <- myFunction(d))

Have not yet finished....

UPDATE:

6) We can get a great increase in speed if we firstly convert the dates to integer. My approach nr.2:

  myFunction2 <- function(data){
    setDT(data)
    seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
    startD <- as.integer(data[["start_date"]])
    endD <- as.integer(data[["end_date"]])
    seqences <- seq2(startD, endD)
    data[, n:= seqences]
    d <- data[, .(day = unlist(n)), by = user_id]
    d[, .(n = uniqueN(day)), by = user_id]
  }

7) Now we can compere to my fist function using bigger data than previously:

d <- replicate(1e4, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, user_id := paste0(user_id, .id)]
d[, .N]
[1] 130000
### BENCHMARK
                    test replications elapsed relative
2  rez1 <- myFunction(d)            1   91.19    7.657
1 rez2 <- myFunction2(d)            1   11.91    1.000
all.equal(rez1, rez2)
[1] TRUE

UPDATE2:

9) It was a mistake to do unlist and uniqueN separately, if we combine that in one single data.table call, we reduce memory usage and increase speed by approximately 3-4 times:

myFunction3 <- function(data){
    setDT(data)
    seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
    startD <- as.integer(data[["start_date"]])
    endD <- as.integer(data[["end_date"]])
    seqences <- seq2(startD, endD)
    data[, n:= seqences]
    data[, .(n = uniqueN(unlist(n))), by = user_id]
  }

rbenchmark::benchmark(rez2 <- myFunction2(d),
                      rez1 <- myFunction3(d), replications = 1, columns = cols)
                    test replications elapsed relative
2 rez1 <- myFunction3(d)            1    4.19    1.000
1 rez2 <- myFunction2(d)            1   14.06    3.356

10) With this last approach I can process 1.3 million rows in ~25 seconds.

With this last approach I can process 0.78 million rows in ~1 minute(depending on memory).

11) original vs last: (on 1300 rows)

             test replications elapsed relative
1 yourFunction(d)            1   10.22  340.667
2  myFunction3(d)            1    0.03    1.000

UPDATE3:

12) Maybe this function can increase speed a bit:

myFunction5 <- function(d){
  setDT(d)
  setkey(d, user_id)
  seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
  startD <- as.integer(d[["start_date"]])
  endD <- as.integer(d[["end_date"]])
  seqences <- seq2(startD, endD)
  dd <- d[, .(list(.I)), by = user_id]
  indlist <- dd[[2]]
  mf <- function(x) uniqueN(unlist(x))
  ff <- function(x) mf(seqences[x])
  ff2 <- Vectorize(ff, "x")
  r <- ff2(indlist)
  data.table(user_id = dd[[1]], n = r, key = "user_id")
}
             test replications elapsed relative
1  myFunction3(d)            1    3.71     1.22
2 myFunction4(d1)            1    3.04     1.00
minem
  • 3,640
  • 2
  • 15
  • 29
  • Great answer and testing, this is essentially what I was after, and cutting down processing time 100 times is a great start. May use it for smaller data sets or coupled with `doParallel` if I figure out how! Thanks! – Kasia Kulma Jul 03 '17 at 16:07
  • this is simply amazing, Martins! thanks a lot for extra work - they improved the original solution.. what, 4000 times? Great stuff – Kasia Kulma Jul 04 '17 at 07:50
  • @KasiaKulma testing against original solution I get approx 388 times, using small data set. The greatest advantage is that we can process bigger data. It is very useful to remember that using integers in any calculations gives great increase in speed. – minem Jul 04 '17 at 07:55
  • Martin, I managed to run your code on the new data, but I'm getting a bug in `MyFunction3` and `MyFunction2` around `seqences <- seq2(start_date, end_date)`, any ideas what's causing it? The first function seems to be running OK – Kasia Kulma Jul 04 '17 at 11:08
  • @KasiaKulma I made an error at that line(`seqences <- seq2(start_date, end_date)`). It shoulde be `seqences <- seq2(startD, endD)`, in `myFunction2` and `myFunction3`. I will update the answer ass soon i have time. – minem Jul 04 '17 at 11:22
  • Including the `copy` inside of your function increases the timing of the function execution and probably favors the slower execution functions. For a proper measure, you should make the copies outside of the function. – lmo Jul 04 '17 at 11:24
  • @KasiaKulma I hope that timings wont change a lot. – minem Jul 04 '17 at 11:27
  • @lmo I did not see any increase in timings using `copy`. But, anyway, i will check once more... – minem Jul 04 '17 at 11:28
  • Martins, so far I couldn't execute functions 2 & 3, I'm getting an error when running `seqences <- seq2(start_date, end_date)`: `Error in FUN(X[[i]], ...) : object 'start_date' not found`.. ??? – Kasia Kulma Jul 04 '17 at 11:55
  • @KasiaKulma look at my previous comments, i will fix my answer as soon I will get time – minem Jul 04 '17 at 11:57
  • sorry, missed it completely, no worries and thanks for your help! – Kasia Kulma Jul 04 '17 at 11:59
  • @KasiaKulma updated answer yet again. Does it work now? – minem Jul 04 '17 at 15:04
  • Thanks, Martins, I appreciate your help and lots of time spent here. I'll re-test the code tonight on ~2.7mln rows and will let you know how it went - but don't worry, I don't expect you to do any more work on it, I don't think the solution can get much better than `myFunction4` :) thanks a million! – Kasia Kulma Jul 04 '17 at 15:30
2

If I understand your question, which is count the number of unique days for each ID, an alternative using Map to construct the sequential dates is

setDT(data)[, .(cnt=uniqueN(unlist(Map(seq, start_date, end_date, by="day")))), by=user_id]
   user_id  cnt
1:      12 2606
2:      33 8967
lmo
  • 37,904
  • 9
  • 56
  • 69
1

This method keeps seq outside the inner loop, but has the unfortunate consequence of being memory hungry, and so breaks down at about 1e5. But depending on your number of user_ids and date range entries, this might be faster:

> d[, .SD
   ][, .(date=seq(from=min(start_date), to=max(end_date), by=1))
   ][d, .(user_id=i.user_id, start_date=i.start_date, end_date=i.end_date, date=x.date), on=.(date >= start_date, date <= end_date), allow.cartesian=T
   ][, unique(.SD, by=c('user_id', 'date'))
   ][, .N, user_id
   ][order(user_id)
   ]
Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46