Delay calculated per group for different splits

Question

I want to calculate the average delay between transactions for different splits. I already have a solution, but I need the delay calculated through a different method.

The dataset looks like this:

customer_id      transaction_date       type      sign     period  
    A               01/01/15              A         C     30 days    
    A               05/01/15              A         C     30 days    
    A               10/01/15              B         D     30 days    
    A               25/01/15              B         D     30 days    

transaction_data = structure(list(customer_id = c("A", "A", "A", "A"), 
transaction_date = c("01/01/15", 
"05/01/15", "10/01/15", "25/01/15"), type = c("A", "A", "B", 
"B"), sign = c("C", "C", "D", "D"), period = c("30 days", "30 days", 
"30 days", "30 days")), .Names = c("customer_id", "transaction_date", 
"type", "sign", "period"), row.names = c(NA, -4L), class = "data.frame")

Solution old method

What I used to do was to calculate the delay between subsequent transactions first, like this:

# Delay between subseauent transactions
library(data.table)
setDT(transaction_data)[,delay_in_transactions_days:= c(0, diff.Date(transaction_date)), .(customer_id)]

# Convert seconds to days
transaction_data <- mutate(transaction_data, delay_in_days = delay_in_transactions_days/86400)
# Convert to integer
transaction_data$delay_in_days <- as.integer(transaction_data$delay_in_days)

And then through dcast calculate the mean of each split for each transaction delay:

dcast(setDT(transaction_data), customer_id ~ paste0("avg_delay_",period), value.var = "delay_in_days", mean)

Problem new method

The new method I want to use to calculate the delay is through the following equation:

For each client: (latest transaction - first transaction) / (number of transactions - 1)

The problem of course, is that the delay cannot be calculated per period, as then it would be delay of all transactions. Instead, it needs to be calculated as delay per period of a particular type or sign, or combination of the splits.

Any ideas how I might approach the problem?

Expected output

customer_id   av.delay_30days  av.delay_30_days_TYPE_A  av.delay_30_days_TYPE_B

     A               8                   4                         15

Perhaps `setDT(transaction_data)[ ,transaction_date := as.Date(transaction_date, "%d/%m/%y")][order(transaction_date), delay := (transaction_date[.N] - transaction_date[1])/(.N-1), customer_id][]` — akrun, Feb 14 '17 at 18:38
well, your code I believe already groups it by Customer_id. But how would I do it for type or sign? — Prometheus, Feb 14 '17 at 18:42
i.e. I get 4 and 15 if I include the 'type' and 'sign', but then the 8 days will be based on only customerid, as a separate group, `setDT(transaction_data)[ ,transaction_date := as.Date(transaction_date, "%d/%m/%y")][order(transaction_date), delay := (transaction_date[.N] - transaction_date[1])/(.N-1), .(customer_id, rleid(type, sign))]` — akrun, Feb 14 '17 at 18:43
so the question would be essentially: what is the delay per customer, within period 30, of transaction from type A? — Prometheus, Feb 14 '17 at 18:43
it is a bit troubling to navigate the site after some updates happened in the site. Didn't see your expected earlier. — akrun, Feb 14 '17 at 18:45
hey akrun, sorry I just checked your new comment with the code. I need just a sec to test it. As I cant think that fast. — Prometheus, Feb 14 '17 at 18:46

Uwe · Answer 1 · 2017-04-18T10:20:31.220

Please, try the following approach which uses dcast()and join from the data.table package.

The formula given by the OP

(latest transaction - first transaction) / (number of transactions - 1)

is implemented as diff(range(transaction_date)) / (length(transaction_date) - 1L).

library(data.table)
setDT(transaction_data)

# coerce transaction_date to class Date
transaction_data[, transaction_date := lubridate::dmy(transaction_date)]

# compute average delay for each customer according to OP's formula
avg_dly_total <- transaction_data[
  , .(av.delay_30days = diff(range(transaction_date), units = "days") / (.N - 1L)), 
  by = customer_id]

avg_dly_total
#   customer_id av.delay_30days
#1:           A          8 days

# compute average delay by Type for each customer
avg_dly_type <- transaction_data[
  , .(av.delay_30days = diff(range(transaction_date), units = "days") / (.N - 1L)), 
  by = .(customer_id, type)]

avg_dly_type
#   customer_id type av.delay_30days
#1:           A    A          4 days
#2:           A    B         15 days

# cast type results from long to wide
value_var <- "av.delay_30days"
temp <- dcast(avg_dly_type, customer_id ~ paste0(value_var, "_TYPE_", type), 
              value.var = value_var)

temp
#   customer_id av.delay_30days_TYPE_A av.delay_30days_TYPE_B
#1:           A                 4 days                15 days

# join with totals
result <- avg_dly_total[temp, on = "customer_id"]

The final result looks almost exactly like the expected output

 result
#   customer_id av.delay_30days av.delay_30days_TYPE_A av.delay_30days_TYPE_B
#1:           A          8 days                 4 days                15 days

Delay calculated per group for different splits

Solution old method

Problem new method

1 Answers1