I have data of 2 million transactions that include userID, invoice nr, invoice date and items purchased. I would like to find out what is the average time interval in between purchases (by customer).
sample.data <- data.frame(userID = c("ID1", "ID2", "ID2", "ID2", "ID3","ID3","ID4"),
invoiceNr = c("INV01", "INV02","INV03", "INV04","INV05", "INV06", "INV07"),
invoiceDate = lubridate::ymd("2008-06-29", "2008-10-10", "2008-10-10","2008-06-12","2008-12-11","2008-03-15","2008-07-14"),
items = c(LETTERS[1:7]))
I tried to generate the intervals with a for loop that populated a new column.
sample.data$intervals <- NA
for (i in 2:nrow(sample.data) {
# if ID matches ID in previous row, calculate difference between purchase dates
ifelse(sample.data$userID[i] == sample.data$userID[i-1],
sample.data$intervals[i] <- as.numeric(difftime(sample.data$invoiceDate[i], sample.data$invoiceDate[i-1], units = "days")),
# if the previous ID is different, then do not calculate the time difference, but mark as NA (this is the first purchase in customers history)
sample.data$intervals[i] <- NA)
}
Form here I would be able to aggregate the data and calculate the overall mean or mean by userID.
However the for loop takes forever with such a large dataset. Is there a faster/better method?