0

I have a data set that I use for machine learning. However, I want to divide my data set into training set and test set. My training set should be made of all the loans issued until September. My test will include the rest (ie, Q4-Oct, Nov, Dec). How can I prepare the data set in the way that I consider most appropriate for this task?

    issue_d int_rate    installment dti revol_bal   revol_util  inq_last_6mths  delinq_2yrs pub_rec loan_status purpose_credit_card purpose_debt_consolidation  purpose_home_improvement    purpose_house   purpose_major_purchase  purpose_medical purpose_moving  purpose_other   purpose_renewable_energy    purpose_small_business  purpose_vacation    purpose_wedding
11  Mar-2018    14.07%  233.05  24.69   707 15.7%   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0
16  Mar-2018    11.98%  232.44  20.25   5004    36% 0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0
17  Mar-2018    26.77%  607.97  24.40   7364    46% 1   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0
20  Mar-2018    20.39%  560.94  15.76   14591   34.2%   0   1   0   1   0   0   0   1   0   0   0   0   0   0   0   0
23  Mar-2018    7.34%   930.99  16.18   755 0%  0   1   0   1   0   0   0   1   0   0   0   0   0   0   0   0
...
130741  Apr-2018    6.07%   309.85  14.64   17380   24.5%   1   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
130742  Apr-2018    11.98%  555.86  21.05   19591   20.5%   2   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0
130744  Apr-2018    11.98%  215.84  14.68   4707    37.7%   1   0   0   1   0   1   0   0   0   0   0   0   0   0   0   0

The type of issue_d is object.

So far I did not worry about dates using:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=123, stratify=y)

Annex : to reproduce the dataframe

You can download the CSV here (bank loans for 2018. They are divided into four quarters). Using Python 3 can be obtained by doing:

import pandas as pd 
# Control delimiters, rows, column names with read_csv (see later) 
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]

result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]
Revolucion for Monica
  • 2,848
  • 8
  • 39
  • 78

2 Answers2

0

You need to convert your Date (Currently an Object) into a datetime.

Then proceed with the ML.

So your Pandas code will be something like

df['issue_d'] = df['issus_d'].astype('datetime64[ns]') 

If however your date times are in some weird/strange/non-standard format, then write a custom function that does

 strptime  (Extract the Custom time)

Then return the date time object.... this function can then be used via an Apply() like this

df['d_object'] = df.d_object.apply(my_convert_function)

Hope that assists

Tim Seed
  • 5,119
  • 2
  • 30
  • 26
0

The 'issue_d' column contains strings such as

['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018']

If we convert these to monthly periods:

In [545]: periods = pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M'); periods
Out[545]: PeriodIndex(['2018-03', '2018-02', '2018-01', '2018-06', '2018-05', '2018-04'], dtype='period[M]', freq='M')

then we can create use an expression like periods <= '2018-09' (yes, PeriodIndexs understand comparison with a string) to create a boolean mask to select rows to go into the training and test DataFrames.

In [558]: pd.PeriodIndex(['Mar-2018', 'Feb-2018', 'Jan-2018', 'Jun-2018', 'May-2018', 'Apr-2018'], freq='M') < '2018-04'
Out[558]: array([ True,  True,  True, False, False, False])

import pandas as pd 
# Control delimiters, rows, column names with read_csv (see later) 
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q3.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q4.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]
result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]

subset['issue_period'] = pd.PeriodIndex(subset['issue_d'].values, freq='M')
mask = (subset['issue_period'] <= '2018-09')
train = subset.loc[mask]
test = subset.loc[~mask]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677