2

I have a reasonably large (3 million rows) data.table containing invoices and payments from many customers, where each customer has a number of (unique) documents that in turn each have a creation date, and a date where the document is payed. If there is no payment date (yet), the payment_date column lists NA. The data looks something like this:

dt = data.table(
  customer_id = c(rep(1,4), rep(2,4)),
  document_id = c(1:8),
  creation_date = as.Date(c("2005-03-01", "2005-03-03", "2005-03-10", "2005-03-25", "2006-03-01", "2006-03-04", "2006-03-10", "2006-03-12"), "%Y-%m-%d"),
  payment_date = as.Date(c("2005-03-05", "2005-03-07", NA, "2005-03-28", "2006-03-05", NA, "2006-03-15", "2006-03-16"), "%Y-%m-%d"),
  open_docs_10 = c(0,1,2,1,0,1,2,3),
  percentage_open_10 = c(0.0,0.20,0.70,1.0,0.0,0.3,1.0,1.0)
)

Per document (i.e. per row), I wish to (ideally) calculate two features:

1) Open_docs_10, which is the number of unpaid or "open" documents the customer_id of the current document had in a certain time window (say 10 days) before the creation date of the document_id. "Open" means on that the payment_date is NA, falls after or falls inside the time interval, while the creation_date is inside or before the time interval.

2) Percentage_open_10, which is the percentage number of days of the time windows the customer had open documents. The number of documents doesn't really matter; the figure says something like "4 out of the 10 previous days did this customer have open payments, when this new document was created".

For 1), I tried something like:


open_docs_10 = dt[,c("customer_id", "document_id", "creation_date", "payment_date")] %>% 
  .[, open_docs_10 := .[.(customer_id = customer_id, upper = creation_date, lower = creation_date - days(10)), 
                       on = .(customer_id, payment_date >= lower, creation_date > lower), uniqueN(document_id), by=.EACHI
                       ]$V1
    ]

But this doesn't give the correct result yet, because the true/correct join condition would be have to be something like

payment_date >= lower OR upper >= creation_date >= lower

It seems I can't use and/or statements inside the "on" clause. But how do I achieve that, using data.table?

For 2), I have no clue how to approach this problem.

I am not bound by using data.table in any real sense; maybe I'm trying to solve my problem in a difficult way when another R package would give a much smarter way of handling things? Any help would be greatly appreciated!

Edward
  • 10,360
  • 2
  • 11
  • 26

1 Answers1

1

I think you are not consistently including or excluding the end dates in your calculations of percentage_open_10. If we include end dates, you can use the following:

ndays <- 10L
setnafill(dt, fill=as.IDate("9999-12-31"), cols="payment_date")

dt[, cd10 := creation_date - ndays + 1L]

dt[, c("open_docs_10", "percentage_open_10") := 
    .SD[.SD, on=.(customer_id, creation_date<=creation_date, payment_date>=cd10), 
        allow.cartesian=TRUE, by=.EACHI, {
        ix <- x.document_id != i.document_id
        p <- 0
        if (any(ix)) {
            lastd <- min(c(i.creation_date, max(x.payment_date[ix]))) 
            firstd <- if (any(ix)) max(c(i.cd10, min(x.creation_date[ix]))) 
            p <- (lastd - firstd + 1) / 10
        }
        .(.N - 1L, p)
    }][, (1L:3L) := NULL]
]

output:

   customer_id document_id creation_date payment_date       cd10 open_docs_10 percentage_open_10
1:           1           1    2005-03-01   2005-03-05 2005-02-20            0                0.0
2:           1           2    2005-03-03   2005-03-07 2005-02-22            1                0.3
3:           1           3    2005-03-10   9999-12-31 2005-03-01            2                0.7
4:           1           4    2005-03-25   2005-03-28 2005-03-16            1                1.0
5:           2           5    2006-03-01   2006-03-05 2006-02-20            0                0.0
6:           2           6    2006-03-04   9999-12-31 2006-02-23            1                0.4
7:           2           7    2006-03-10   2006-03-15 2006-03-01            2                1.0
8:           2           8    2006-03-12   2006-03-16 2006-03-03            3                1.0

However, with 3 million rows, I am not hopeful that this can be completed in a few seconds.

chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Many many thanks for your help! You are correct that I was inconsistent with having the end/start dates included/excluded in the interval. I haven't run the code on the full dataset yet, but when I do, I'll let you know how long it took! – ImpactGuide Mar 20 '20 at 21:45
  • Just as a small update, to run the code on the full dataset it took about 6-7 minutes on an i7-7700k. It's not super fast (in a real time sense), but considering the amount of data being queried a very satisfactory result! – ImpactGuide Mar 23 '20 at 15:42
  • 1
    Is your document id in running order? Maybe can save a min or two – chinsoon12 Mar 23 '20 at 21:40
  • I have just added this suggestion to the script, and that really speeds things up. It's now down to around 4 minutes! I use the script to create several variables, each one looking back over an increasingly longer period of time, so this addition ends up saving around 10 minutes in total, which is great! Many thanks! – ImpactGuide Mar 25 '20 at 13:45