I am having some issues with the steps to follow for over-sampling a dataset. What I have done is the following:
# Separate input features and target
y_up = df.Label
X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)
# setting up testing and training sets
X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)
class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]
# upsample minority
class_1_upsampled = resample(class_1,
replace=True,
n_samples=len(class_0),
random_state=27) #
# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])
Since my dataset looks like:
Label Text
1 bla bla bla
0 once upon a time
1 some other sentences
1 a few sentences more
1 this is my dataset!
I applied a vectorizer to transform string into numbers:
X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]
X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)
Then I applied the logistic regression function:
upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)
However, I have got the following error at this step:
X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
pred_up_log = upsampled_log.predict(X_test_up)
ValueError: X has 3021 features per sample; expecting 5542
Since it was told me that I should apply the oversampling after splitting my dataset into train e test, I have not vectorised the test set. My doubts are then the following:
- is it right to consider later a vectorisation of the test set:
X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
- is it right to consider the over-sampling after splitting the dataset into training and test?
Alternatively, I tried with Smote function. The code below works, but I would prefer to consider also the oversampling, if possible, rather than SMOTE.
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))
Any comments and suggestions will be appreciated. Thanks