I have an imbalanced dataset which has 200 million data from class 0 and 8000 data from class 1. I followed two different approaches to build a model.
- Randomly sample a new dataset which has a ratio of 1:4. Meaning 32000 from class 0 and 8000 from class 1. Then use featuretools to generate features(70 features generated in my case) and split dataset into train and test set with test_size = 0.2 and stratify minority class. Build a model with Random Forest algorithm and predict the test set.
Code:
import ....
df = pd.read_csv(...)
label = df['target']
es = ft.EntitySet(id='maintable')
es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})
es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')
fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)
fm = fm.set_index(label.index)
fm['target'] = label
X = fm[fm.columns.difference(['target'])]
y = fm['target']
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,stratify=y,test_size=0.2)
rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
#print results
.....
- Split all the data from class 1, use 60% for train set and 40% for test set. Class ratio for train set is same as first approach(1:4) but for test set it is 1:200. Use featuretools(70 features created again), build a model with Random Forest algorithm and predict test set.
Code:
import ....
df = pd.read_csv(...)
# I merged randomly generated(with java) train and test sets to create features with featuretools. I created a column 'test_data' which takes two binary values (1 for test set 0 for train set) so I can separate train and test set for fitting model and predicting.
label = df['target','test_data']
es = ft.EntitySet(id='maintable')
es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})
es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')
fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)
fm = fm.set_index(label.index)
fm['target','test_data'] = label
df_train = fm.loc[fm['test_data'] == 0]
df_test = fm.loc[fm['test_data'] == 1]
#Drop 'test_data' column because I dont need it anymore
df_train = df_train.drop(['test_data'],axis=1)
df_test = df_test.drop(['test_data'],axis=1)
X_train = df_train[df_train.columns.difference(['target'])]
y_train = df_train['target']
X_test = df_test[df_test.columns.difference(['target'])]
y_test = df_test['target']
rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
#print results
Now interesting part begins for me. Here are the results of two approaches.
1. Approach: (Class 0 is negative and class 1 is positive)
TN:6306
FP:94
TP:1385
FN:215
2. Approach:
TN:576743
FP:63257
TP:361
FN:2839
First result is pretty good for me but second one is terrible. How is this possible? I know I am using less data from class 1 to train model on second approach but it should not differ that much. I mean it is worse than coin flip. Subsets are randomly generated on both approaches and I tried many different subsets but results are pretty much same as above. Any kind of help is appreciated.
Edit: I may have an idea but not sure... I am using train_test_split on first approach. So train and test sets share some personal_id's but on second approach train and test sets have completely different personal_id's. When model encounters with a personal_id that it didn't see before it cannot predict correctly and decides to label it majority class. If this is the case then features are being created exactly for given categorical variables(overfitting). Again when it encounters with a different value for any categorical column, it just gets confused. How can I overcome such an issue?
Edit2: I tested the idea mentioned above and got weird results. First I removed personal_id column from dataset but it ended up with better model. Then I tested my second approach in a way that personal_id's appear in train set should also appear in test set. I thought I would get better model but it was worse than before. I am really confused...