0

I'm working on a big data set of corporate account data in order to solve a classification problem if a firm goes bankrupt or not.

The dataset contains a variable liquid which states the year when the liquidation started. This variable is omnipresent in every year of observation given that the firm actually starts liquidation. Otherwise it is zero. Usually, liquid is larger than the last year of observation. So, there are no observations of the corporate data in the year the firm starts liquidation. Sometimes, there are even longer gaps. For example, a firm starts liquidation in 2005 but the last observation of the financial ratios is in 2002.

A sample of the data might look like this:

sample table

Now, I want to create a new dummy called bankruptcy. This should take the value of 1, if it is the last observation (with financial data) of a company that starts liquidation. You can see how bankruptcy should look like in the table above. How do I proceed?

J...S
  • 5,079
  • 1
  • 20
  • 35
Patrick
  • 1
  • 1

2 Answers2

0

there is probably a better way but how about

library(dplyr)

df <-structure(list(year = structure(c(1L, 2L, 3L, 2L, 3L, 4L, 5L,  2L, 3L), .Label = c("2000", "2001", "2002", "2003", "2004"), class = "factor"), liquid = structure(c(2L, 2L, 2L, NA, NA, NA, NA, 1L, 1L), .Label = c("2003",  "2005"), class = "factor"), company = structure(c(1L, 1L, 
1L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),  bankruptcy = c(0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("year", "liquid", "company", "bankruptcy"), row.names = c(NA, -9L), class = "data.frame")



df %>%
   mutate(bankruptcy = 0)  %>%
   group_by(company) %>%
   mutate(bankruptcy = c(bankruptcy[-n()], 1)) %>%
   mutate(bankruptcy = ifelse(is.na(liquid),0,bankruptcy))
CER
  • 854
  • 10
  • 22
  • I get somehow not the results I was aiming for, but the solution of seeellayewhy worked. Thank you anyway :) – Patrick Nov 01 '17 at 16:19
  • Can you specify what is wrong? Output looks like what you where asking for – CER Nov 01 '17 at 16:22
0

If I understand you correctly from your desired output, you want bankruptcy to take on a 1 in the highest value of liquid for each company.

h/t to @user6617454 for the structure.

df <-structure(list(year = structure(c(1L, 2L, 3L, 2L, 3L, 4L, 5L,  2L, 3L), .Label = c("2000", "2001", "2002", "2003", "2004"), class = "factor"), liquid = structure(c(2L, 2L, 2L, NA, NA, NA, NA, 1L, 1L), .Label = c("2003",  "2005"), class = "factor"), company = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),  ), .Names = c("year", "liquid", "company"), row.names = c(NA, -9L), class = "data.frame")

df$year <- as.numeric(as.character(df$year))

df$maxyear <- tapply(df$year, df$company, max)
df$bankruptcy <- ifelse(!is.na(df$liquid) & df$year == df$maxyear, 
                        1, 
                        0)

In that solution, bankruptcy will take on a 1 when there was a liquid value for the company and the particular row is the max for that company. If your exmaple isn't representative of your actual problem this might not work but this did produce the output in your attached image.

cparmstrong
  • 799
  • 6
  • 23
  • Thank you for your response, it basically worked! I used it for a large dataframe, so instead of tapply, I used the aggregate function and then merged the dataframe to my overall dataframe to create the bankruptcy variable. – Patrick Nov 01 '17 at 16:18
  • I began with that approach but went with tapply because it was a bit more elgeant. @Patrick could you accept my answer if it's the one you used? – cparmstrong Nov 01 '17 at 17:08