creating new columns based on several conditions in R

Question

I have a data frame consisting of three columns and the unique values for status are as follows "X" "0" "C" "1" "2" "3" "4" "5". In the beginning, I do not know how to group by each id and create several columns according to the conditions, for instance, a target column that is 1 if the status is 2, 3, 4, 5, and else is zero.

month_balance represents (The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on)

status represents (0: 1-29 days past due, 1: 30-59 days past due, 2: 60-89 days overdue, 3: 90-119 days overdue, 4: 120-149 days overdue, 5: Overdue or bad debts write-offs for more than 150 days C: paid off that month, X: No loan for the month)

df <- data.frame (id  = c("5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805"),
                  month_balance = c("0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14","-15","0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14"),
                  status = c("C","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X")
                  )

In the end, I want to reach output as below:

df1 <- data.frame (id  = c("5008804","5008805"),
                  month_begin = c("16","15"),
                  paid_off = c("13","12"),
                  num_of_pastdues = c("2","2"),
                  no_loan = c("1","1"),
                  target = c("0","0"))

sorry I just started learning, I download data from https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction — tara, Jun 29 '22 at 08:39

score 0 · Answer 1 · answered Jun 29 '22 at 08:39

Not quite sure how to code for target, as the status for each id appeared a both target 0 and 1 appeared multiple times.

Here's how I constructed for the other variables:

df %>% 
    group_by(id) %>% 
    summarise(
        month_begin=max(abs(as.numeric(month_balance)))+1, 
        paid_off=sum(status=="C"), 
        num_of_pastdues=sum(status %in% 0:5), 
        no_loan=sum(status=="X"))

# A tibble: 2 x 5
  id      month_begin paid_off num_of_pastdues no_loan
  <chr>         <dbl>    <int>           <int>   <int>
1 5008804          16       13               2       1
2 5008805          15       12               2       1

In order to determine the customer who fits the definition of bad, I choose users who are overdue for more than 60 days as target risk users. — tara, Jun 29 '22 at 08:43

score 0 · Answer 2 · answered Jun 29 '22 at 08:42

library(tidyverse)

df <- data.frame (id  = c("5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805"),
                  month_balance = c("0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14","-15","0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14"),
                  status = c("C","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X")
) %>% 
  as_tibble()

df %>%  
  mutate(target = case_when(status %in% c(2, 3, 4, 5) ~ 1, 
                            TRUE ~ 0), 
         paid_off = case_when(status == "C" ~ 1, 
                              TRUE ~ 0), 
         no_loan = case_when(status == "X" ~ 1,
                             TRUE ~ 0)) %>%  
  
  group_by(id) %>%  
  summarise(month_begin = n(), 
            across(c(paid_off, no_loan, target), sum))
#> # A tibble: 2 x 5
#>   id      month_begin paid_off no_loan target
#>   <chr>         <int>    <dbl>   <dbl>  <dbl>
#> 1 5008804          16       13       1      0
#> 2 5008805          15       12       1      0

^{Created on 2022-06-29 by the reprex package (v2.0.1)}

score 0 · Answer 3 · answered Jun 29 '22 at 08:46

0

You can try using dplyr. First you can create the variables with the conditions you want, then you can use summarize to count the times your condition was met per group.

df <- df %>%
  mutate(num_of_pastdues = case_when(
    status %in% c(2,3,4,5) ~ 1,
    TRUE ~ 0
  )) %>%
  mutate(no_loan  = case_when(
    status == "X" ~ 1,
    TRUE ~ 0
  )) %>%
  mutate(paid_off  = case_when(
    status == "C" ~ 1,
    TRUE ~ 0
  )) %>%
  group_by(id) %>% 
  summarise(num_of_pastdues = sum(num_of_pastdues), no_loan = sum(no_loan), paid_off = sum(paid_off))

answered Jun 29 '22 at 08:46

Djoustaine

65
7

1

You don't have to make multiple mutate calls. You can them all in one call, just add a comma after each crreated column. Like : `df |> mutate(col = col+2, col2 = col + 4, ...)` – Jun 29 '22 at 09:04
Thanks, I didn't know that was possible! – Djoustaine Jun 29 '22 at 09:12

Sotos · Accepted Answer · 2022-06-29T10:07:07.257

0

A base R solution can be to create a custom function and apply it on each group, i.e.

MyFunction <- function(x){
  month_begin = length(x)
  paid_off = sum(x == 'C')
  num_of_pastdues = sum(x %in% 0:5)
  no_loan = sum(x == 'X')
  target = ifelse(any(x %in% 2:5), 1, 0)
  return(c(month_begin=month_begin, paid_off=paid_off, num_of_pastdues=num_of_pastdues, no_loan=no_loan, target=target))
}

res <- t(sapply(split(df$status, df$id), MyFunction))

             month_begin paid_off num_of_pastdues no_loan target
#    5008804 16          13       2               1       0     
#    5008805 15          12       2               1       0

To make it a data frame with the column id then,

res_df <- data.frame(res)
res_df$id <- rownames(res_df)
rownames(res_df) <- NULL

res_df

#month_begin paid_off num_of_pastdues no_loan target      id
#1          16       13               2       1      0 5008804
#2          15       12               2       1      0 5008805

edited Jun 29 '22 at 10:07

answered Jun 29 '22 at 09:13

Sotos

51,121
6
32
66

thanks, however, I encounter another problem :( class() function gives me "matrix" "array" and then when I try to convert the matrix into a data frame, values seem like – tara Jun 29 '22 at 09:40
or how can we do the same process by not deleting the id column, I can't merge it with another dataset. – tara Jun 29 '22 at 09:59
I edited my answer. Note that I also changed the output of `MyFunction` – Sotos Jun 29 '22 at 10:07
res_df %>% count(target) give me only 0 – tara Jun 29 '22 at 10:18
1

sorry, my fault. thank you very much, I was dealing with it for a long time. – tara Jun 29 '22 at 10:24

creating new columns based on several conditions in R

4 Answers4