1

I have an imbalanced dataset which has 200 million data from class 0 and 8000 data from class 1. I followed two different approaches to build a model.

  1. Randomly sample a new dataset which has a ratio of 1:4. Meaning 32000 from class 0 and 8000 from class 1. Then use featuretools to generate features(70 features generated in my case) and split dataset into train and test set with test_size = 0.2 and stratify minority class. Build a model with Random Forest algorithm and predict the test set.

Code:

import ....
df = pd.read_csv(...)
label = df['target']
es = ft.EntitySet(id='maintable')

es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})

es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')

fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)

fm = fm.set_index(label.index)
fm['target'] = label

X = fm[fm.columns.difference(['target'])]
y = fm['target']

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,stratify=y,test_size=0.2)

rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

#print results
.....
  1. Split all the data from class 1, use 60% for train set and 40% for test set. Class ratio for train set is same as first approach(1:4) but for test set it is 1:200. Use featuretools(70 features created again), build a model with Random Forest algorithm and predict test set.

Code:

import ....
df = pd.read_csv(...)
# I merged randomly generated(with java) train and test sets to create features with featuretools. I created a column 'test_data' which takes two binary values (1 for test set 0 for train set) so I can separate train and test set for fitting model and predicting. 
label = df['target','test_data']
es = ft.EntitySet(id='maintable')

es = es.entity_from_dataframe(entity_id='maintable',dataframe=df,make_index=True,
index='index',time_index='date_info',variable_types={'personal_id': ft.variable_types.Categorical,
'category_id': ft.variable_types.Categorical, 'name': ft.variable_types.Categorical})

es.normalize_entity(base_entity_id='maintable',new_entity_id='personal_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='category_id')
es.normalize_entity(base_entity_id='maintable',new_entity_id='name')

fm, features = ft.dfs(entityset=es,target_entity='maintable',max_depth=3)

fm = fm.set_index(label.index)
fm['target','test_data'] = label

df_train = fm.loc[fm['test_data'] == 0]
df_test = fm.loc[fm['test_data'] == 1]

#Drop 'test_data' column because I dont need it anymore
df_train = df_train.drop(['test_data'],axis=1)
df_test = df_test.drop(['test_data'],axis=1)

X_train = df_train[df_train.columns.difference(['target'])]
y_train = df_train['target']

X_test = df_test[df_test.columns.difference(['target'])]
y_test = df_test['target']

rf = RandomForestClassifier(random_state=42,n_jobs=-1)
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)

#print results

Now interesting part begins for me. Here are the results of two approaches.

1. Approach: (Class 0 is negative and class 1 is positive)

TN:6306

FP:94

TP:1385

FN:215

2. Approach:

TN:576743

FP:63257

TP:361

FN:2839

First result is pretty good for me but second one is terrible. How is this possible? I know I am using less data from class 1 to train model on second approach but it should not differ that much. I mean it is worse than coin flip. Subsets are randomly generated on both approaches and I tried many different subsets but results are pretty much same as above. Any kind of help is appreciated.

Edit: I may have an idea but not sure... I am using train_test_split on first approach. So train and test sets share some personal_id's but on second approach train and test sets have completely different personal_id's. When model encounters with a personal_id that it didn't see before it cannot predict correctly and decides to label it majority class. If this is the case then features are being created exactly for given categorical variables(overfitting). Again when it encounters with a different value for any categorical column, it just gets confused. How can I overcome such an issue?

Edit2: I tested the idea mentioned above and got weird results. First I removed personal_id column from dataset but it ended up with better model. Then I tested my second approach in a way that personal_id's appear in train set should also appear in test set. I thought I would get better model but it was worse than before. I am really confused...

mcsahin
  • 63
  • 1
  • 7

1 Answers1

1

I agree the model possibly overfitted and failed to generalize given the new personal id. I suggest passing the labels in with the cutoff times to get a more structured training and testing set. I'll go through a quick example using this data.

    index     name  personal_id category_id   date_info  target
0       0   Samuel            3           C  2021-07-15       0
1       1   Samuel            3           C  2021-07-15       0
2       2   Samuel            3           C  2021-07-15       0
3       3   Samuel            3           C  2021-07-15       0
4       4  Rosanne            2           C  2021-05-11       0
..    ...      ...          ...         ...         ...     ...
95     95    Donia            1           C  2020-09-27       1
96     96    Donia            1           C  2020-09-27       1
97     97  Fleming            1           A  2021-06-15       1
98     98     Fred            1           C  2021-02-28       0
99     99  Giacomo            1           A  2021-06-19       1

[100 rows x 6 columns]

First, create cutoff times based on the time index that also include the target column. Make sure to drop the target column from the original data.

target = df[['date_info', 'index', 'target']]
df.drop(columns='target', inplace=True)

Then, you can structure the entity set as usual.

import featuretools as ft

es = ft.EntitySet(id='maintable')
es = es.entity_from_dataframe(
    entity_id='maintable',
    dataframe=df,
    index='index',
    time_index='date_info',
    variable_types={
        'personal_id': ft.variable_types.Categorical,
        'category_id': ft.variable_types.Categorical,
        'name': ft.variable_types.Categorical
    },
)
es.normalize_entity(base_entity_id='maintable', new_entity_id='personal_id', index='personal_id',)
es.normalize_entity(base_entity_id='maintable', new_entity_id='category_id', index='category_id')
es.normalize_entity(base_entity_id='maintable', new_entity_id='name', index='name')

Now, in the DFS call, you can pass in the target cutoff times. This approach will not use the target column to build features and ensures that the target column will remain aligned with the feature matrix.

fm, fd = ft.dfs(entityset=es, target_entity='maintable', max_depth=3, cutoff_time=target)
       personal_id category_id     name  DAY(date_info)  ...  name.NUM_UNIQUE(maintable.MONTH(date_info))  name.NUM_UNIQUE(maintable.WEEKDAY(date_info))  name.NUM_UNIQUE(maintable.YEAR(date_info))  target
index                                                    ...
59               1           C     Fred              28  ...                                            1                                              1                                           1       0
35               1           A  Giacomo              19  ...                                            1                                              1                                           1       1
82               3           B  Laverna              17  ...                                            1                                              1                                           1       0
25               2           C  Rosanne              11  ...                                            1                                              1                                           1       0
23               1           A  Giacomo              19  ...                                            1                                              1                                           1       1

Then, you can split the feature maxtrix into a training and testing set.

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(fm, test_size=.2, shuffle=False)
y_train, y_test = X_train.pop('target'), X_test.pop('target')

For AutoML, you can use EvalML to find the best ML pipeline and graph a confusion matrix.

from evalml import AutoMLSearch
from evalml.model_understanding.graphs import graph_confusion_matrix

automl = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type='binary',
    allowed_model_families=['random_forest'],
)
automl.search()
y_pred = automl.best_pipeline.predict(X_test)
graph_confusion_matrix(y_test, y_pred).show()

enter image description here

You can find similar machine learning examples in the linked page. Let me know if you found this helpful.

Jeff Hernandez
  • 2,063
  • 16
  • 20
  • Ok thanks I will try this one but I have one question. You used date_info for both time_index and cut_off time. Shouldn't it be different? Maybe cut_off_time = date_info - 90 days? – mcsahin Aug 03 '21 at 11:11
  • 1
    I implemented your example code and this is what I got: TN: 72476, FP: 0, FN: 86, TP: 1893 Precision: 100%, Recall: 96% This is obviously overfitted but I can fix it. Now I should test this with my second approach. I need to create train set and test set in java, merge them, use featuretools to create features, separate them, fit model and predict. I hope I can get a good model. – mcsahin Aug 03 '21 at 13:43
  • 1
    You can use the time index `date_info` as the cutoff time. This approach ensures that you are building features only with data that exists up to the cutoff time. Each label can have a corresponding cutoff time so that there isn't any data leakage in model training. – Jeff Hernandez Aug 03 '21 at 16:31
  • 1
    If you want to exclude data at the cutoff time, you can use `include_cuttoff_time=False` in the DFS call. You can find more details about [handling time](https://featuretools.alteryx.com/en/stable/getting_started/handling_time.html#excluding-data-at-cutoff-times) on the linked page. – Jeff Hernandez Aug 03 '21 at 16:38
  • 1
    I think I understand my problem. This line `label = df['target','test_data']` and these two lines `fm = fm.set_index(label.index)` `fm['target','test_data'] = label` causing the problem. Feature matrix(fm) and `label` are not being aligned correctly when merging them. I tested this on a small dataset and I realized a row that should be labeled as class 1 getting labeled class 0 after merge process completed. This mistake causes model to fail. Thanks for help @Jeff. I will try to fix the issue and update this post. I am still not clear about cutoff_time so I may have questions again. – mcsahin Aug 04 '21 at 12:57