clean up the data frame based on conditions

Question

I have a dataframe looking like this.

Dummy Data:

df1 <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002", 
                               "02/07/2002","24/07/2002", "08/07/2002",
                               "08/07/2002", "15/07/2002", "17/07/2002", 
                               "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"), 
                      payment = c(200,   1000, 1000, 
                                  -1000,-1000, 1200,
                                  -1200, 1200, 1200,
                                  200, 56700, -56700, -200), 
                      Code = c("ABC", "M567", "M567", "M567", 
                               "M567","M567", "M567", 
                                "M567", "M567", "M300", 
                               "M678", "M678", "ABC"), 
                      ID = c(NA, "98","98","M11",
                             NA,"K999", 
                             "K999", "111", "111", "11",
                             "12345", NA, NA)), row.names = c(NA, -13L), class = "data.frame")

Some of the rows are just return of payment:

         Date payment Code    ID
1  24/06/2002     200  ABC  <NA>
2  24/06/2002    1000 M567    98
3  25/06/2002    1000 M567    98
4  02/07/2002   -1000 M567   M11
5  24/07/2002   -1000 M567  <NA>
6  08/07/2002    1200 M567  K999
7  08/07/2002   -1200 M567  K999
8  15/07/2002    1200 M567   111
9  17/07/2002    1200 M567   111
10 22/07/2002     200 M300    11
11 22/07/2002   56700 M678 12345
12 28/07/2002  -56700 M678  <NA>
13 29/07/2002    -200  ABC  <NA>

I want to delete the rows that has a same +ve and -ve payment by taking Code and ID into consideration. i.e.,

if the code and ID column matches and the payment is cancelling out then delete those two columns
if ID is present for +ve payment and absent for -ve payment viseversa then only take Code for grouping
If ID is absent for both +ve and -ve then only take Code column for grouping

So, the resultant dataframe should be like this:

  Date       payment Code  ID   
  <chr>        <dbl> <chr> <chr>
1 22/07/2002     200 M300  11   
2 15/07/2002    1200 M567  111  
3 17/07/2002    1200 M567  111  
4 25/06/2002    1000 M567  98   
5 02/07/2002   -1000 M567  M11

Can anyone help me with this?

akrun · Accepted Answer · 2022-10-16T16:38:15.873

3

We group by 'Code' column, then create a sequence column for each unique element of 'payment', then add the sequence also with 'Code' to fill the missing element in 'ID' by the previous non-NA id', ungroup, remove the rows where 'ID' is still NA (drop_na), create a column of payment with absolute values, order by 'Code', 'ID' and absolute payment column, then, grouped by the same columns, create a grouping column by taking the sequence of sign values of 'payment', add as grouping column, and filter only groups having a single row

library(dplyr)
library(data.table)
library(tidyr)
library(lubridate)
df1 %>% 
  group_by(Code) %>%
  filter(sum(payment) != 0) %>%
  arrange(Code, abs(payment), !is.na(ID)) %>%
  mutate(ind = rowid(payment)) %>%
  group_by(ind, .add = TRUE) %>% 
  fill(ID, .direction = 'downup') %>% 
  ungroup %>%
  drop_na(ID) %>% 
  mutate(absPayment = abs(payment)) %>% 
  arrange(ID, Code, absPayment) %>%
  group_by(Code, ID, absPayment) %>%
  mutate(grp = rowid(sign(payment))) %>% 
  group_by(grp, .add = TRUE) %>%
  filter(n() == 1) %>% 
  ungroup %>%   
  select(names(df1))

-output

# A tibble: 5 × 4
  Date       payment Code  ID   
  <chr>        <dbl> <chr> <chr>
1 22/07/2002     200 M300  11   
2 15/07/2002    1200 M567  111  
3 17/07/2002    1200 M567  111  
4 25/06/2002    1000 M567  98   
5 02/07/2002   -1000 M567  M11

data

df1 <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002", 
"02/07/2002", "24/07/2002", "08/07/2002", "08/07/2002", "15/07/2002", 
"17/07/2002", "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"
), payment = c(200, 1000, 1000, -1000, -1000, 1200, -1200, 1200, 
1200, 200, 56700, -56700, -200), Code = c("ABC", "M567", "M567", 
"M567", "M567", "M567", "M567", "M567", "M567", "M300", "M678", 
"M678", "ABC"), ID = c(NA, "98", "98", "M11", NA, "K999", "K999", 
"111", "111", "11", "12345", NA, NA)), row.names = c(NA, -13L
), class = "data.frame")

edited Oct 16 '22 at 16:38

answered Oct 11 '22 at 15:38

akrun

874,273
37
540
662

Thank you @akrun. Can you please add comments if possible – Bella_18 Oct 12 '22 at 07:30
Sorry to ask again. I just realized that there are some rows where ID is missing for the negative row. Can you please tell me how can I implement an "or statement" at group by? – Bella_18 Oct 12 '22 at 07:45
Thank you very much. @akrun I just added an extra line of data . So even if ID is NA it has to check Code column and cancel out the +ve and -ve signs. ( Also updated the data) – Bella_18 Oct 13 '22 at 11:06
So that three cases are covered - If code and Id both are present / ID is present for +ve and absent for negative / ID is absent for +ve and -ve – Bella_18 Oct 13 '22 at 11:16
I have updated with all the possible cases in the question with same datatypes with original data. Code and ID is not always same for all the payment rows. So, filling NAs with the use of direction gives an ambiguous answer. Sorry for re-editing I feel this a bit complicated - since I am very new to R programming – Bella_18 Oct 14 '22 at 13:05
Please once run the code you provided with the updated data - u would understand my question @akrun. I ran it just now and since direction is "downup" NA will be replaced by its nearest value based on ID rather than the opposite sign's ID – Bella_18 Oct 14 '22 at 17:10
I have altered the row- 4 and 5. due to which NA's replacement wouldn't be replaces as expected – Bella_18 Oct 14 '22 at 17:13
1

I am really sorry about it. I really appreciate your time and patience. This is the last update. I am sorry again - I could not figure out a way to fill NA's with the opposite sign Code and Payment. If date is used in arrange - NAs will not be filled by checking the sign of payment – Bella_18 Oct 14 '22 at 17:27
I have been manually typing the expected inputs and outputs. I have just pasted the data from the console – Bella_18 Oct 14 '22 at 18:05
So, the row --> `24/07/2002 -1000 M567 ` is the return/refund of the payment --> which should be cancelled out with the original payment i.e `24/06/2002 1000 M567 98` or `25/06/2002 1000 M567 98` - hence leave the remaining two rows uncancelled – Bella_18 Oct 14 '22 at 18:15
Thanks a lot. I really appreciate your help. I have been struck at this since 4 days @akrun – Bella_18 Oct 15 '22 at 18:04
Can you please tell me how ``fill(ID, .direction = 'downup') %>% `` is filling NA with the opposite sign? because in the group by or in the mutate statements you did not mention anything related to sign of the payment – Bella_18 Oct 15 '22 at 20:45
But this is working for a small dataset. If the row ``25/06/2002 1000 M567 98 `` is far away(6-7 rows above) from its Negative payment the NA is not getting filled – Bella_18 Oct 16 '22 at 10:12
The only problem is to identify the opposite sign of payment and fill NA with its ID of the opposite sign. Which is difficult to write in a code. @akrun – Bella_18 Oct 16 '22 at 15:51
https://stackoverflow.com/questions/74086430/delete-wrongly-added-rows-with-help-of-few-columns-in-r/74086577?noredirect=1#comment130808332_74086577 - please check my newly posted question – Bella_18 Oct 16 '22 at 16:05
Can we write the code in a way where - NA is filled if sum(payment) is zero and "Code" is same for both rows? - This would solve the issue I think – Bella_18 Oct 16 '22 at 16:21
1

@bella_pa adding that additional condition, gives the same output for this example you showed – akrun Oct 16 '22 at 16:25
Thanks a lot for you time @akrun. But Even this way``. direction= "downup"`` wouldn't work if the rows are far way. It sure does work for the dummy data – Bella_18 Oct 16 '22 at 16:35

Waldi · Answer 2 · 2022-10-14T08:16:56.993

1

With data.table:

first fill missing IDs by Code / amount
group all payments (+/-) which have the same amount, ID, Code : this gives a data.table subset for each group noted .SD
take the sum of signs of +/- payments in each .SD
- if the sum is positive (for example +/-/+/+=2), filter only the positive values in .SD and keep the last sum(sign) positive values (here 2) using tail
- if the sum is negative (for example +/-/-=-1), filter only the negative values in .SD and keep the last sum(sign) negative values (here 1) using tail
if the sum is 0, keep 0 rows : the rows disappear as expected.

library(data.table)
setDT(df1)

df1[,ID:=nafill(ID,type = 'locf'),by=.(Code,abs(payment))][
    ,tail(.SD[sign(sum(sign(payment)))==sign(payment)],abs(sum(sign(payment))))
    ,by=.(ID, Code,abs(payment))]

#>       ID   Code   abs       Date
#>    <int> <char> <num>     <char>
#> 1:    98   M567  1000 01/07/2002
#> 2:   111   M567  1200 15/07/2002
#> 3:   111   M567  1200 17/07/2002
#> 4:    11   M300   200 22/07/2002

edited Oct 14 '22 at 08:16

answered Oct 13 '22 at 18:34

Waldi

39,242
6
30
78

Thank you @Waldi. Can you please ad comments to the code. It is difficult to understand what you did – Bella_18 Oct 14 '22 at 07:22
1

@bella_pa, see my edit with more explainations – Waldi Oct 14 '22 at 07:39
Thank you again. How would Code- 12345 cancel out with NA ? – Bella_18 Oct 14 '22 at 07:42
1

The `NA`s are filled in the first step described above – Waldi Oct 14 '22 at 07:43
Is it possible to fill NA's only taking into consideration of the opposite sign ? Instead of using abs(payment). – Bella_18 Oct 14 '22 at 11:18
1

abs(payment) is meant to fill the opposite sign : `-1000/Code;1000/NA` and `1000/Code; -1000/NA` should work. If this isn't what you expect, please share an example. – Waldi Oct 14 '22 at 11:27
na.fill is also not working for string type. Can you please help me out with this – Bella_18 Oct 14 '22 at 11:59
1

For string ID, you can use `zoo::na.locf(ID)` instead of `nafill(ID,type = 'locf')` – Waldi Oct 14 '22 at 13:24
Can you please update your answer w.r.t to updated data – Bella_18 Oct 14 '22 at 13:33
1

I'll do this, but could you first update your dummy data `df1 <- structure(...)` which doesn't seem to correspond to the subsequent modifications you made to your question, for instance `56700, -56900` seems wrong. – Waldi Oct 14 '22 at 13:36
Please check the updated data. Sorry for all the re-editing I feel this code a bit complicated ( I am very new to R programming) – Bella_18 Oct 14 '22 at 13:41

clean up the data frame based on conditions

2 Answers2

data

Linked