1

I have data of 2 million transactions that include userID, invoice nr, invoice date and items purchased. I would like to find out what is the average time interval in between purchases (by customer).

sample.data <- data.frame(userID = c("ID1", "ID2", "ID2", "ID2", "ID3","ID3","ID4"),
invoiceNr = c("INV01", "INV02","INV03", "INV04","INV05", "INV06", "INV07"),
invoiceDate = lubridate::ymd("2008-06-29", "2008-10-10", "2008-10-10","2008-06-12","2008-12-11","2008-03-15","2008-07-14"),
items = c(LETTERS[1:7]))

I tried to generate the intervals with a for loop that populated a new column.

sample.data$intervals <- NA

for (i in 2:nrow(sample.data) {
  # if ID matches ID in previous row, calculate difference between purchase dates
  ifelse(sample.data$userID[i] == sample.data$userID[i-1], 
         sample.data$intervals[i] <- as.numeric(difftime(sample.data$invoiceDate[i], sample.data$invoiceDate[i-1], units = "days")),
  # if the previous ID is different, then do not calculate the time difference, but mark as NA (this is the first purchase in customers history)
        sample.data$intervals[i] <- NA)
}

Form here I would be able to aggregate the data and calculate the overall mean or mean by userID.

However the for loop takes forever with such a large dataset. Is there a faster/better method?

2 Answers2

0

You can use dplyr:

library(dplyr)

sample.data %>% 
  group_by(userID) %>% 
  arrange(invoiceDate) %>% 
  mutate(timediff = c(NA, diff(invoiceDate))) %>% 
  summarise(mean_time_diff = mean(timediff, na.rm = TRUE))
#> # A tibble: 4 x 2
#>   userID mean_time_diff
#>   <chr>           <dbl>
#> 1 ID1               NaN
#> 2 ID2                60
#> 3 ID3               271
#> 4 ID4               NaN

Obviously, if the user has only made a single purchase, the average time between purchases is NA

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
0

data.table solution :

library(data.table)
sample.data[order(userID,invoiceDate),
            .(lastVisit=difftime(invoiceDate,lag(invoiceDate,),unit = "days"),
              nbVisit = .N),
            by=userID]

   userID lastVisit nbVisit
1:    ID1   NA days       1
2:    ID2   NA days       3
3:    ID2  120 days       3
4:    ID2    0 days       3
5:    ID3   NA days       2
6:    ID3  271 days       2
7:    ID4   NA days       1

you can also average this for each customer :

sample.data[order(userID,invoiceDate),
            .(lastVisit=difftime(invoiceDate,lag(invoiceDate,),unit = "days"),
              nbVisit = .N),
            by=userID][,.(avg=mean(lastVisit,na.rm=T)),by=.(userID,nbVisit)]
   userID nbVisit      avg
1:    ID1       1 NaN days
2:    ID2       3  60 days
3:    ID3       2 271 days
4:    ID4       1 NaN days
Waldi
  • 39,242
  • 6
  • 30
  • 78