I have a dataset that is based on a number of transactions (16M) for the full dataset. In this dataset there are a mix of positive and negative transactions. The challenge I have is that the negative amounts don't always pair as an inverse amount of the positive amounts.
My goal is to have the negative amounts cancel out any positive amounts. There are some cases where the refund is done over multiple transactions to zero out the positive transaction, and not all transactions get zeroed out.
Below is an example data set that demonstrates the point
library(data.table)
dt <- data.table(
id = c(
rep("ID1", 5), rep("ID2", 6), rep("ID3", 4)),
date = as.Date(c(
"2022-01-01", rep("2021-07-01", 3), "2023-01-01",
"2022-01-15", "2022-03-10", "2022-04-01", "2022-04-02", "2021-07-01", "2021-07-01",
"2022-01-15", "2022-02-15", "2022-03-15", "2022-04-15")),
amount = c(
100, 500, -350, -150, 20,
22, 250, -200, -50, 75, -25,
rep(48,4))
)
setkey(dt, id, date, amount)
# Key: <id, date, amount>
# id date amount
# <char> <Date> <num>
# 1: ID1 2021-07-01 -350 \
# 2: ID1 2021-07-01 -150 - These gifts are all on the same date and the negative amounts cancel the positive
# 3: ID1 2021-07-01 500 /
# 4: ID1 2022-01-01 100
# 5: ID1 2023-01-01 20
# 6: ID2 2021-07-01 -25 \
# 7: ID2 2021-07-01 75 - These 2 transactions are on the same day but leave a net value of 50
# 8: ID2 2022-01-15 22
# 9: ID2 2022-03-10 250 \
# 10: ID2 2022-04-01 -200 - These transactions end up cancelling each other out, but are on different dates
# 11: ID2 2022-04-02 -50 /
# 12: ID3 2022-01-15 48
# 13: ID3 2022-02-15 48
# 14: ID3 2022-03-15 48
# 15: ID3 2022-04-15 48
I have created a function that does the work, but it makes use of a recursive lapply
which I think is causing things to be quite inefficient at the scale of 16M transactions.
apply_refunds_fast <- function(dt) {
donorbase <- copy(dt)
refunds <- donorbase[, let(idx = .I)] |>
_[amount < 0, .(idx, date, amount, id)]
lapply(
seq_len(nrow(refunds)),
function(x, refunds, donorbase) {
single_refund <- refunds[x]
tx_to_reverse <- donorbase[
id == single_refund$id
& date <= single_refund$date
& amount >= (single_refund$amount * -1),
idx[.N]]
donorbase[
tx_to_reverse,
let(amount = amount + single_refund$amount)]
if (length(tx_to_reverse)) {
donorbase[
single_refund$idx,
let(amount = 0,
refund_status = "Refund Applied")]
}
},
refunds, donorbase
)
round(donorbase$amount, 3)
}
dt[, amount := apply_refunds_fast(.SD)]
dt
# # expected outcome
# id date amount
# <char> <Date> <num>
# 1: ID1 2021-07-01 0
# 2: ID1 2021-07-01 0
# 3: ID1 2021-07-01 0
# 4: ID1 2022-01-01 100
# 5: ID1 2023-01-01 20
# 6: ID2 2021-07-01 0
# 7: ID2 2021-07-01 50
# 8: ID2 2022-01-15 22
# 9: ID2 2022-03-10 0
# 10: ID2 2022-04-01 0
# 11: ID2 2022-04-02 0
# 12: ID3 2022-01-15 48
# 13: ID3 2022-02-15 48
# 14: ID3 2022-03-15 48
# 15: ID3 2022-04-15 48
I'd appreciate any expertise in how to make this more efficient at scale. If anyone would like a sample of data with more records to work with please let me know and I'll try to make something available.
I have also seen these two questions but they're attempting to address slightly different outcomes.