cleaning the data-frame with conditions

Question

I am trying to clean a dataframe by deleting the wrongly added rows.

This is the dummy data :

temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", 
                               "02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
                               "08/07/2002", "15/07/2002", "17/07/2002", 
                               "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"), 
                      payment = c(200, 1000,-1000, -1000, 1000,
                                  -1000,-1000,-1000, 1200,
                                  -1200, 1200, 1200,
                                  200, 56700, -56700, -200), 
                      Code = c("ABC", "M567", "M567","M567", "XYZ", "M567", "ABX" ,
                               "M567","M567", "M567", 
                               "M567", "M567", "M300", 
                               "M678", "M678", "ABC"), 
                      ID = c(NA, "98","187","187","12ee","M11","M13",
                             NA,"K999", 
                             "K999", "111", "111", "11",
                             "12345", NA, NA)), row.names = c(NA, -16L), class = "data.frame")

The dataframe looks this this

         Date payment Code    ID
1  24/06/2002     200  ABC  <NA>
2  24/06/2002    1000 M567    98
3  25/06/2002   -1000 M567   187
4  25/06/2002   -1000 M567   187
5  26/06/2002    1000  XYZ  12ee
6  02/07/2002   -1000 M567   M11
7  03/07/2002   -1000  ABX   M13
8  24/07/2002   -1000 M567  <NA>
9  08/07/2002    1200 M567  K999
10 08/07/2002   -1200 M567  K999
11 15/07/2002    1200 M567   111
12 17/07/2002    1200 M567   111
13 22/07/2002     200 M300    11
14 22/07/2002   56700 M678 12345
15 28/07/2002  -56700 M678  <NA>
16 29/07/2002    -200  ABC  <NA>

As you can see there are some positive and negative payments in the data. The negative payments are the wrongly added transactions or refunds.

For example +1200 will cancel out with -1200 based on Code and ID whereas row 14 and 15 are similar but the ID is NA - So I have to fill that with ID of its positive payment row and viseversa. So that I can delete those both rows.

Code I tried with the help of a programmer on StackOverflow (previously asked):

library(dplyr)
library(data.table)
library(tidyr)
Final_df <- df1 %>% 
  group_by(Code) %>%
  mutate(ind = rowid(payment)) %>%
  group_by(ind, .add = TRUE) %>% 
  fill(ID, .direction = 'downup') #%>% 
  ungroup %>%
  mutate(absPayment = abs(payment)) %>% 
  arrange(ID, Code, absPayment) %>%
  group_by(Code, ID, absPayment) %>%
  mutate(grp = rowid(sign(payment))) %>% 
  group_by(grp, .add = TRUE) %>%
  filter(n() == 1) %>% 
  ungroup %>%   
  select(names(df1))

But the problem here is row 8 - 24/07/2002 -1000 M567 should be filled by row 2 as the code and positive payment is matched - so that later I can cancel these both rows. Since the row is far away from row 8 .direction = 'downup' is not working.

And I think there is a better way to fill NA's other than using direction( As it is not getting applied with similar rows far away)

The Expected output is:

         Date payment Code    ID

1  25/06/2002   -1000 M567   187
2  25/06/2002   -1000 M567   187
3  26/06/2002    1000  XYZ  12ee
4  02/07/2002   -1000 M567   M11
5  03/07/2002   -1000  ABX   M13
6  15/07/2002    1200 M567   111
7  17/07/2002    1200 M567   111
8  22/07/2002     200 M300    11

I am struck at this since 5 days. Any solutions would be really helpful.

Thanks in advance

Another possible Dummy data:

temp_2 <-  structure(list(Date = c("22/06/2002", "23/06/2002","24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", 
                               "02/07/2002","03/07/2002","24/07/2002", "08/07/2002",
                               "08/07/2002", "15/07/2002", "17/07/2002", 
                               "22/07/2002", "22/07/2002", "28/07/2002", "29/07/2002"), 
                      payment = c(200,-1000, 1000,-1000, -1000, 1000,
                                  -1000,-1000,-1000, 1200,
                                  -1200, 1200, 1200,
                                  200, 56700, -56700, -200), 
                      Code = c("ABC", "M567","M567", "M567","M567", "XYZ", "M567", "ABX" ,
                               "M567","M567", "M567", 
                               "M567", "M567", "M300", 
                               "M678", "M678", "ABC"), 
                      ID = c(NA,"187", "98","187","187","12ee",NA,NA,
                             NA,"K999", 
                             "K999", "111", "111", "11",
                             "12345", NA, NA)), row.names = c(NA, -17L), class = "data.frame")

Expected Output for temp_2:

         Date payment Code    ID

1  23/06/2002   -1000 M567   187
2  25/06/2002   -1000 M567   187
3  25/06/2002   -1000 M567   187
4  26/06/2002    1000  XYZ  12ee
5  03/07/2002   -1000  ABX  <NA>
6  24/07/2002   -1000 M567   98
7 15/07/2002    1200 M567   111
8 17/07/2002    1200 M567   111
9 22/07/2002     200 M300    11

akrun · Accepted Answer · 2022-10-17T17:47:35.267

2

We can use

library(dplyr)
library(data.table)
f1 <- function(dat) {
        i1 <- is.na(dat$ID) & nrow(dat) > 1
         if(any(i1)) {
              dat$ID[i1] <- dat$ID[!i1][match(dat$payment[i1], 
                -dat$payment[!i1])]
               }
            return(dat)
            }
            
temp %>%
   mutate(rn = row_number()) %>%
   group_by(Code, absPayment = abs(payment)) %>%
   filter(sum(payment) != 0) %>%
   group_modify(~ f1(.x)) %>%
   group_by(ID, .add = TRUE) %>%
   mutate(grp = rowid(sign(payment))) %>% 
   group_by(grp, .add = TRUE) %>%
   filter(n() == 1) %>% 
   ungroup %>%
    arrange(rn) %>%
   select(names(temp))

-output

# A tibble: 8 × 4
  Date       payment Code  ID   
  <chr>        <dbl> <chr> <chr>
1 25/06/2002   -1000 M567  187  
2 25/06/2002   -1000 M567  187  
3 26/06/2002    1000 XYZ   12ee 
4 02/07/2002   -1000 M567  M11  
5 03/07/2002   -1000 ABX   M13  
6 15/07/2002    1200 M567  111  
7 17/07/2002    1200 M567  111  
8 22/07/2002     200 M300  11

For the second case

 temp_2 %>%
   mutate(rn = row_number()) %>% 
   group_by(Code, absPayment = abs(payment)) %>%
   filter(sum(payment) != 0) %>%
   group_modify(~ f1(.x)) %>%
   group_by(ID, .add = TRUE) %>%
   mutate(grp = rowid(sign(payment))) %>% 
   group_by(grp, .add = TRUE) %>%
   filter(n() == 1) %>% 
   ungroup %>%
   arrange(rn) %>%
   select(names(temp_2))

-output

# A tibble: 9 × 4
  Date       payment Code  ID   
  <chr>        <dbl> <chr> <chr>
1 23/06/2002   -1000 M567  187  
2 25/06/2002   -1000 M567  187  
3 25/06/2002   -1000 M567  187  
4 26/06/2002    1000 XYZ   12ee 
5 03/07/2002   -1000 ABX   <NA> 
6 24/07/2002   -1000 M567  98   
7 15/07/2002    1200 M567  111  
8 17/07/2002    1200 M567  111  
9 22/07/2002     200 M300  11

edited Oct 17 '22 at 17:47

answered Oct 16 '22 at 16:44

akrun

874,273
37
540
662

1

@bella_pa I am not sure whether there are other cases as well. So far, this solution works for the 3 cases you showed (including the one in other question) – akrun Oct 16 '22 at 17:01
1

I will try this for the real data. I covered all the cases I hope this works – Bella_18 Oct 16 '22 at 17:06
for second case the expected output do not match. ``2 24/07/2002 -1000 M567 111 `` must be filled with its '98' but rather got filled with 111 which is the text of 1200 payment – Bella_18 Oct 17 '22 at 07:58
Can we do something like this? ``Step 1:`` delete rows that have same code, same ID and payment of +ve and -ve pair. (which cuts down the data) ``Step 2:`` Delete the rows with NA if they have a matching Code and sum(payment) = = 0. In this way, NA rows would be deleted with its pair – Bella_18 Oct 17 '22 at 13:22
1

@bella_pa Your Step 1: will not work because same ID implies all the NAs will be clubbed together i.e. if there are more than one NA for a single code and if its payment match it will be removed? – akrun Oct 17 '22 at 15:16
1

@bella_pa what is `assignment` in the group – akrun Oct 17 '22 at 15:45
Sorry thats from original data. Step 1: you already solved it - in the old question -- ``dummy%>% mutate(absPayment = abs(payment)) %>% arrange(Code, ID, absPayment) %>% group_by(Code, ID, absPayment) %>% mutate(grp = data.table::rowid(sign(amount))) %>% group_by(grp, .add = TRUE) %>% filter(n() == 1) %>% ungroup %>% select(-grp, -absPayment)`` – Bella_18 Oct 17 '22 at 15:46
1

@bella_pa if i use that code in your comments (removing the columns that are not present) `temp_2 %>% mutate(absPayment = abs(payment)) %>% arrange(Code, ID, absPayment) %>% group_by(absPayment) %>% mutate(grp = data.table::rowid(sign(payment))) %>% group_by(grp, .add = TRUE) %>% filter(n() == 1) %>% ungroup %>% select(-grp, -absPayment)` I get only 7 rows – akrun Oct 17 '22 at 15:47
No No. ``groupby(Code, ID, absPayment) `` when you run that with temp_2 you would get 13 rows. Where 4 rows would be deleted ``1200,-1200 and 200,-200`` – Bella_18 Oct 17 '22 at 15:53
It is okay if both 200 and -200 has NA for ID - its a pair and can be removed. The case where NA is only for one sign is to be solved in step 2 – Bella_18 Oct 17 '22 at 15:57
1

@bella_pa sure, let me test it – akrun Oct 17 '22 at 16:32
@bella_pa By checking your temp_2, i have a doubt `3 24/06/2002 1000 M567 98` 3rd row and `7 02/07/2002 -1000 M567 ` and 7th row, why can't these two rows match to fill 98 for 7th row instead of 9th row `9 24/07/2002 -1000 M567 `? Is it based on the date? – akrun Oct 17 '22 at 16:54
1

@bella_pa even if it is by date, 02/07/2002 comes before 24/07/2002 and thus the first one should get filled with the 98 ID where its date is 24/06/2002 – akrun Oct 17 '22 at 17:06
@bella_pa is that a typo in your input data? – akrun Oct 17 '22 at 17:10
yes we consider date wise usually. Since its NA we do not know which row is a refund. So any row can be cancelled – Bella_18 Oct 17 '22 at 17:16
But now, I am trying for a solution where only rows with NAs are considered and - instead of filling them with possible ID - Directly to cancel the row which has a sum(payment) == 0 and Code is matched. Irrespective of ID column – Bella_18 Oct 17 '22 at 17:19
1

@bella_pa so you will either have 02/07/2002 for 98 or 24/07/2002 for 98 as valid row – akrun Oct 17 '22 at 17:19
Its okay if filling 98 is not possible. because there are too many conditions and too manys cases to be looked into which is somehow filling NA with wrong values. Thats the reason for Step 2 we can ignore the ID column – Bella_18 Oct 17 '22 at 17:22
1

@bella_pa it is possible to fill 98, but my issue is which row to select i.e. either 02/07/2002 or 24/07/2002 can be filled with 98, here the logic is not clear – akrun Oct 17 '22 at 17:27
02/07/2002 is better for preference – Bella_18 Oct 17 '22 at 17:32
1

@bella_pa try the update. Now you get the rows as in expected. Works for both datasets – akrun Oct 17 '22 at 17:40
Is it just for Step 2? or both combined? – Bella_18 Oct 17 '22 at 17:43
@bella_pa I created a function `f1` which can be used for replacing the NA in `ID` and then do the grouping with rowid as showed in the updatd solution – akrun Oct 17 '22 at 17:44
1

@bella_pa just for easier comparision, I added a row number at the start and then arranged at the end by rn. Now, you can compare more easily – akrun Oct 17 '22 at 17:48
1

Yes. Thank you very much NA's are getting filled as expected. and the output also matches – Bella_18 Oct 17 '22 at 17:50

score 1 · Answer 2 · answered Oct 16 '22 at 11:07

1

Here is my attempt at solving it, the trick is to replace the NAs correctly.

# fill NAs according to their values 
temp <- temp %>% 
  mutate(abs_payment = abs(payment)) %>% 
  group_by(abs_payment, ID, Code) %>% 
  # should consider replacement only if ID has only one row or if it is NA
  mutate(is_candidate = (n() == 1) | is.na(ID)) %>%
  group_by(abs_payment, Code) %>% 
  # we do not want to replace IDs for non-na IDs 
  mutate(new_ID = case_when(is_candidate & is.na(ID) ~ na.omit(ID)[1],
                            TRUE ~ ID))


# remove if sum equal to 0 
temp <- temp %>% 
  group_by(Code, new_ID, abs_payment) %>% 
  mutate(total = sum(payment)) %>% 
  filter(total != 0 )

answered Oct 16 '22 at 11:07

Kozolovska

1,090
6
14

Thank you very much @Kozolovska . Can you explain the step ``mutate(new_ID = case_when(is_candidate & is.na(ID) ~ na.omit(ID)[1], TRUE ~ ID))``. How is NA exactly getting replaced by its opposite payment. I am a bit confused – Bella_18 Oct 16 '22 at 12:39
case_when is a switch statement basically it means, if is_candidate is True and the ID is NA then replace it with the first ID which is not NA (na.omit(ID)[1]). If that statement is not TRUE then simply replace it with the ID. This step is in order to replace the NAs, later I sum the payments according to the absolute value, new ID column and the Code and I remove all groups where the sum of payments is 0. – Kozolovska Oct 16 '22 at 12:55
But this code is not working for temp_2 (please see the end of the question- i added dummy data with few extra columns). NA is getting replaced with wrong value. Can you please help me with this – Bella_18 Oct 16 '22 at 13:13

score 0 · Answer 3 · answered Oct 16 '22 at 11:43

I'm not sure if this is correct, but here is my attempt. I don't see exactly how you are deriving the expected output. Are their other filtering criteria? Your original data has more than 8 rows.

library(tidyverse)

temp |>
  mutate(Date = lubridate::dmy(Date)) |>
  arrange(Code, abs(payment)) |>
  group_by(Code, abs(payment), ID) |>
  mutate(n = n()) |>
  ungroup()|>
  group_by(Code, abs(payment), n) |>
  fill(ID, .direction = "updown") |>
  ungroup()|>
  select(names(temp)) |>
  arrange(Date, abs(payment))
#> # A tibble: 16 x 4
#>    Date       payment Code  ID   
#>    <date>       <dbl> <chr> <chr>
#>  1 2002-06-24     200 ABC   <NA> 
#>  2 2002-06-24    1000 M567  98   
#>  3 2002-06-25   -1000 M567  187  
#>  4 2002-06-25   -1000 M567  187  
#>  5 2002-06-26    1000 XYZ   12ee 
#>  6 2002-07-02   -1000 M567  M11  
#>  7 2002-07-03   -1000 ABX   M13  
#>  8 2002-07-08    1200 M567  K999 
#>  9 2002-07-08   -1200 M567  K999 
#> 10 2002-07-15    1200 M567  111  
#> 11 2002-07-17    1200 M567  111  
#> 12 2002-07-22     200 M300  11   
#> 13 2002-07-22   56700 M678  12345
#> 14 2002-07-24   -1000 M567  M11  
#> 15 2002-07-28  -56700 M678  12345
#> 16 2002-07-29    -200 ABC   <NA>

Normally a positive payment row would cancel out with a negative row that has same ID and Code. But My data has few NA's in ID while refund! So In that case - I am supposed to check the positive row that has same payment and Code to delete the rows — Bella_18, Oct 16 '22 at 12:09

cleaning the data-frame with conditions

3 Answers3

Linked