I have a data set that I use for machine learning. However, I want to divide my data set into training set and test set. My training set should be made of all the loans issued until September. My test will include the rest (ie, Q4-Oct, Nov, Dec). How can I prepare the data set in the way that I consider most appropriate for this task?
issue_d int_rate installment dti revol_bal revol_util inq_last_6mths delinq_2yrs pub_rec loan_status purpose_credit_card purpose_debt_consolidation purpose_home_improvement purpose_house purpose_major_purchase purpose_medical purpose_moving purpose_other purpose_renewable_energy purpose_small_business purpose_vacation purpose_wedding
11 Mar-2018 14.07% 233.05 24.69 707 15.7% 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
16 Mar-2018 11.98% 232.44 20.25 5004 36% 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
17 Mar-2018 26.77% 607.97 24.40 7364 46% 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
20 Mar-2018 20.39% 560.94 15.76 14591 34.2% 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0
23 Mar-2018 7.34% 930.99 16.18 755 0% 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0
...
130741 Apr-2018 6.07% 309.85 14.64 17380 24.5% 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
130742 Apr-2018 11.98% 555.86 21.05 19591 20.5% 2 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
130744 Apr-2018 11.98% 215.84 14.68 4707 37.7% 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
The type of issue_d is object
.
So far I did not worry about dates using:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=123, stratify=y)
Annex : to reproduce the dataframe
You can download the CSV here (bank loans for 2018. They are divided into four quarters). Using Python 3 can be obtained by doing:
import pandas as pd
# Control delimiters, rows, column names with read_csv (see later)
data_Q1 = pd.read_csv("LoanStats_2018Q1.csv", skiprows=1, skipfooter=2, engine='python')
data_Q2 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q3 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
data_Q4 = pd.read_csv("LoanStats_2018Q2.csv", skiprows=1, skipfooter=2, engine='python')
frames = [data_Q1,data_Q2,data_Q3,data_Q4]
result = pd.concat(frames)
subset = result.loc[result["loan_status"].isin(['Charged Off','Fully Paid'])]