I have a reasonably large (3 million rows) data.table containing invoices and payments from many customers, where each customer has a number of (unique) documents that in turn each have a creation date, and a date where the document is payed. If there is no payment date (yet), the payment_date column lists NA. The data looks something like this:
dt = data.table(
customer_id = c(rep(1,4), rep(2,4)),
document_id = c(1:8),
creation_date = as.Date(c("2005-03-01", "2005-03-03", "2005-03-10", "2005-03-25", "2006-03-01", "2006-03-04", "2006-03-10", "2006-03-12"), "%Y-%m-%d"),
payment_date = as.Date(c("2005-03-05", "2005-03-07", NA, "2005-03-28", "2006-03-05", NA, "2006-03-15", "2006-03-16"), "%Y-%m-%d"),
open_docs_10 = c(0,1,2,1,0,1,2,3),
percentage_open_10 = c(0.0,0.20,0.70,1.0,0.0,0.3,1.0,1.0)
)
Per document (i.e. per row), I wish to (ideally) calculate two features:
1) Open_docs_10, which is the number of unpaid or "open" documents the customer_id of the current document had in a certain time window (say 10 days) before the creation date of the document_id. "Open" means on that the payment_date is NA, falls after or falls inside the time interval, while the creation_date is inside or before the time interval.
2) Percentage_open_10, which is the percentage number of days of the time windows the customer had open documents. The number of documents doesn't really matter; the figure says something like "4 out of the 10 previous days did this customer have open payments, when this new document was created".
For 1), I tried something like:
open_docs_10 = dt[,c("customer_id", "document_id", "creation_date", "payment_date")] %>%
.[, open_docs_10 := .[.(customer_id = customer_id, upper = creation_date, lower = creation_date - days(10)),
on = .(customer_id, payment_date >= lower, creation_date > lower), uniqueN(document_id), by=.EACHI
]$V1
]
But this doesn't give the correct result yet, because the true/correct join condition would be have to be something like
payment_date >= lower OR upper >= creation_date >= lower
It seems I can't use and/or statements inside the "on" clause. But how do I achieve that, using data.table?
For 2), I have no clue how to approach this problem.
I am not bound by using data.table in any real sense; maybe I'm trying to solve my problem in a difficult way when another R package would give a much smarter way of handling things? Any help would be greatly appreciated!