How to remove rows after a particular observation is seen for the first time

Question

I have a dataset wherein I have account number and "days past due" with every observation. For every account number, as soon as the "days past due" column hits a code like "DLQ3", I want to remove rest of the rows for that account (even if DLQ3 is the first observation for that account).

My dataset looks like:

Obs_month  Acc_No       OS_Bal      Days_past_due
201005     2000000031   3572.68     NORM
201006     2000000031   4036.78     NORM
200810     2000000049   39741.97    NORM
200811     2000000049   38437.54    DLQ3
200812     2000000049   23923.98    DLQ1
200901     2000000049   35063.88    NORM

So, for account 2000000049, I want to remove all the rows post the date 200812 as now it's in default.

So in all, I want to see when the account hits DLQ3 and when it does I want to remove all the rows post the first DLQ3 observation.

What I tried was to subset the data with all DLQ3 observations and order the observation month in ascending order and getting an unique list of account number which have DLQ3 and their first month of hitting DLQ3. Post that I thought I could do some left_join with the orginal data and use ifelse but the flow is dicey.

Can you indent your dataset. Hard to understand its structure. (highlight data, press command K on Mac) — Cybernetic, Jan 12 '18 at 05:18

score 0 · Answer 1 · answered Jan 12 '18 at 06:26

The following function will scan your data frame and find the row containing the DLQ3 tag. It will then remove all rows for that account number that occur after that tag.

scan_table <- function(data_frame, due_column, acct_column, due_tag) {
    for(i in 1:nrow(data_frame)) {
        if(data_frame[i,c(due_column)] == due_tag) {
            # remove rows past here, for this account 
            acct_num <- data_frame[i,c(acct_column)]
            top_frame <- data_frame[1:i,] # cut point
            sub_frame <- subset(data_frame, Acc_No != acct_num)
            final_frame <- unique(do.call('rbind', list(top_frame, sub_frame)))
            return(final_frame)
        }
    }
}

Example:

df

Usage:

scan_table(df, 'Days_past_due', 'Acc_No', 'DLQ3')

Let me know if you wanted something different.

In the subframe line why is there a change in the name for account number? Have you assumed them to be written differently? — Nupur Jain, Jan 12 '18 at 11:27
The subframe line removes the offending account number from the dataset. Not sure what you mean by changing names. — Cybernetic, Jan 12 '18 at 16:20

Patrik_P · Answer 2 · 2018-01-12T12:04:18.970

0

Given your example

data <- read.table(text=
"Obs_month  Acc_No       OS_Bal      Days_past_due
201005     2000000031   3572.68     NORM
201006     2000000031   4036.78     NORM
200810     2000000049   39741.97    NORM
200811     2000000049   38437.54    DLQ3
200812     2000000049   23923.98    DLQ1
200901     2000000049   35063.88    NORM", stringsAsFactors=F, header=T)

I will sort it

data <- data[with(data, order(Acc_No, Obs_month)), ]

and define a function that allows you to set the code indicating expiry ("DLQ3" or "DLQ1" from your example)

sbst <- function(data, pattern){
  if( all(data$Days_past_due %in% "NORM") == TRUE){
    return(data)} else{
      indx <- min(grep(1, match(data$Days_past_due, pattern, nomatch = 0)))
      data <- data[1:indx,]
      return(data)
    }
}

Finally, apply the function and aggregate the lists of data.frame into final data.frame

Reduce(rbind, lapply(split(data, data$Acc_No), sbst, patter="DLQ3"))
#  Obs_month     Acc_No   OS_Bal Days_past_due
#1    201005 2000000031  3572.68          NORM
#2    201006 2000000031  4036.78          NORM
#3    200810 2000000049 39741.97          NORM
#4    200811 2000000049 38437.54          DLQ3

edited Jan 12 '18 at 12:04

answered Jan 12 '18 at 06:41

Patrik_P

3,066
3
22
39

This way throws an error called : Error in match(data$Days_past_due, pattern, nomatch = 0) : argument "pattern" is missing, with no default Called from: grep(1, match(data$Days_past_due, pattern, nomatch = 0)) – Nupur Jain Jan 12 '18 at 11:46
pattern declaration was missing `patter="DLQ3"`, added it, thnx for the notice – Patrik_P Jan 12 '18 at 12:05
If I understand your coding right there but be an error due to my part. There are 4 categories in the days past due that is, NORM, DLQ1, DLQ2, DLQ3. And the code is taking care of only NORM? – Nupur Jain Jan 12 '18 at 12:30
I supposed that NORM mean OK, indicating non-expiry, hence only the other values like DLQ1, DLQ2, DLQ3. could be indication of expiry. Is my assuption correct? – Patrik_P Jan 12 '18 at 12:32
The point wherein the account number hits DLQ3 for the first time that means he is in default. So an account can have all the 4 observations but we need to remove the for which it hits DLQ3 for the very first time – Nupur Jain Jan 12 '18 at 12:43

How to remove rows after a particular observation is seen for the first time

2 Answers2

Linked