3

I have a dataset that is based on a number of transactions (16M) for the full dataset. In this dataset there are a mix of positive and negative transactions. The challenge I have is that the negative amounts don't always pair as an inverse amount of the positive amounts.

My goal is to have the negative amounts cancel out any positive amounts. There are some cases where the refund is done over multiple transactions to zero out the positive transaction, and not all transactions get zeroed out.

Below is an example data set that demonstrates the point

library(data.table)

dt <- data.table(
  id = c(
    rep("ID1", 5), rep("ID2", 6), rep("ID3", 4)),
  date = as.Date(c(
    "2022-01-01", rep("2021-07-01", 3), "2023-01-01",
    "2022-01-15", "2022-03-10", "2022-04-01", "2022-04-02", "2021-07-01", "2021-07-01",
    "2022-01-15", "2022-02-15", "2022-03-15", "2022-04-15")),
  amount = c(
    100, 500, -350, -150, 20,
    22, 250, -200, -50, 75, -25,
    rep(48,4))
)

setkey(dt, id, date, amount)
# Key: <id, date, amount>
#         id       date amount
#     <char>     <Date>  <num>
#  1:    ID1 2021-07-01   -350  \
#  2:    ID1 2021-07-01   -150  - These gifts are all on the same date and the negative amounts cancel the positive
#  3:    ID1 2021-07-01    500  /
#  4:    ID1 2022-01-01    100
#  5:    ID1 2023-01-01     20
#  6:    ID2 2021-07-01    -25  \
#  7:    ID2 2021-07-01     75  - These 2 transactions are on the same day but leave a net value of 50
#  8:    ID2 2022-01-15     22
#  9:    ID2 2022-03-10    250  \
# 10:    ID2 2022-04-01   -200  - These transactions end up cancelling each other out, but are on different dates
# 11:    ID2 2022-04-02    -50  /
# 12:    ID3 2022-01-15     48
# 13:    ID3 2022-02-15     48
# 14:    ID3 2022-03-15     48
# 15:    ID3 2022-04-15     48

I have created a function that does the work, but it makes use of a recursive lapply which I think is causing things to be quite inefficient at the scale of 16M transactions.

apply_refunds_fast <- function(dt) {
  donorbase <- copy(dt)
  refunds <- donorbase[, let(idx = .I)] |>
    _[amount < 0, .(idx, date, amount, id)]
  lapply(
    seq_len(nrow(refunds)),
    function(x, refunds, donorbase) {
      single_refund <- refunds[x]
      tx_to_reverse <- donorbase[
        id == single_refund$id
        & date <= single_refund$date
        & amount >= (single_refund$amount * -1),
        idx[.N]]
      donorbase[
        tx_to_reverse,
        let(amount = amount + single_refund$amount)]
      if (length(tx_to_reverse)) {
        donorbase[
          single_refund$idx,
          let(amount = 0,
              refund_status = "Refund Applied")]
      }
    },
    refunds, donorbase
  )
  round(donorbase$amount, 3)
}

dt[, amount := apply_refunds_fast(.SD)]
dt

# # expected outcome
#         id       date amount
#     <char>     <Date>  <num>
#  1:    ID1 2021-07-01      0
#  2:    ID1 2021-07-01      0
#  3:    ID1 2021-07-01      0
#  4:    ID1 2022-01-01    100
#  5:    ID1 2023-01-01     20
#  6:    ID2 2021-07-01      0
#  7:    ID2 2021-07-01     50
#  8:    ID2 2022-01-15     22
#  9:    ID2 2022-03-10      0
# 10:    ID2 2022-04-01      0
# 11:    ID2 2022-04-02      0
# 12:    ID3 2022-01-15     48
# 13:    ID3 2022-02-15     48
# 14:    ID3 2022-03-15     48
# 15:    ID3 2022-04-15     48

I'd appreciate any expertise in how to make this more efficient at scale. If anyone would like a sample of data with more records to work with please let me know and I'll try to make something available.

I have also seen these two questions but they're attempting to address slightly different outcomes.

Dan
  • 2,625
  • 5
  • 27
  • 42
  • Isn't it simply a sum by group of `id` + `date`? – bretauv Aug 02 '23 at 09:42
  • 1
    Or maybe I misunderstood your problem, can you explain the process for a few rows? It's a bit hard to grasp with this big chunk of code – bretauv Aug 02 '23 at 09:46
  • Unfortunately it isn't as simple as sum and group by `id` + `date`. Rows 9-11 are the same `id` but different dates and reflect the a refund type transaction over a few days. – Dan Aug 02 '23 at 09:56
  • 1
    Where does `let()` come from? – NicChr Aug 02 '23 at 09:59
  • 3
    @NicChr `let()` is an [alias for the functional form](https://rdatatable.gitlab.io/data.table/reference/assign.html) of `:=`. – SamR Aug 02 '23 at 10:03
  • 1
    I have edited the post to provide some clarifying information as requested by @bretauv – Dan Aug 02 '23 at 10:06

1 Answers1

2

Sort each id by decreasing date, then the zeroing out can be vectorized by group, with the groups starting at a new id or a series of consecutive returns:

setkey(
  setorder(dt, id, -date, amount)[
    ,amount := pmin(pmax(amount, 0), pmax(cumsum(amount), 0)),
    cumsum(id != shift(id, 1, "") | sign(shift(amount, 1, 0)) - sign(amount) == 2)
  ],
  id, date, amount
)[]
#>      id       date amount
#>  1: ID1 2021-07-01      0
#>  2: ID1 2021-07-01      0
#>  3: ID1 2021-07-01      0
#>  4: ID1 2022-01-01    100
#>  5: ID1 2023-01-01     20
#>  6: ID2 2021-07-01      0
#>  7: ID2 2021-07-01     50
#>  8: ID2 2022-01-15     22
#>  9: ID2 2022-03-10      0
#> 10: ID2 2022-04-01      0
#> 11: ID2 2022-04-02      0
#> 12: ID3 2022-01-15     48
#> 13: ID3 2022-02-15     48
#> 14: ID3 2022-03-15     48
#> 15: ID3 2022-04-15     48
jblood94
  • 10,340
  • 1
  • 10
  • 15