Oversampling after splitting the dataset - Text classification

Question

I am having some issues with the steps to follow for over-sampling a dataset. What I have done is the following:

# Separate input features and target
y_up = df.Label

X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)

# setting up testing and training sets

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)

class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]


# upsample minority
class_1_upsampled = resample(class_1,
                          replace=True, 
                          n_samples=len(class_0), 
                          random_state=27) #

# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])

Since my dataset looks like:

Label     Text 
1        bla bla bla
0        once upon a time 
1        some other sentences
1        a few sentences more
1        this is my dataset!

I applied a vectorizer to transform string into numbers:

X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]

X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)

Then I applied the logistic regression function:

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)

However, I have got the following error at this step:

X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)

pred_up_log = upsampled_log.predict(X_test_up)

ValueError: X has 3021 features per sample; expecting 5542

Since it was told me that I should apply the oversampling after splitting my dataset into train e test, I have not vectorised the test set. My doubts are then the following:

is it right to consider later a vectorisation of the test set: X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
is it right to consider the over-sampling after splitting the dataset into training and test?

Alternatively, I tried with Smote function. The code below works, but I would prefer to consider also the oversampling, if possible, rather than SMOTE.

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)

nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))

Any comments and suggestions will be appreciated. Thanks

you need to do vectorizer.fit_transform() on the whole dataset, otherwise there will be features present in your train and not in your test, and vice versa — StupidWolf, Nov 30 '20 at 14:49
You can fill in the missing columns but it will be super messy — StupidWolf, Nov 30 '20 at 14:50
thanks @StupidWolf. How could I apply it to the whole dataset? Can I pass it before splitting into train and test? — V_sqrt, Nov 30 '20 at 15:07
ok i see the issue now. You need to upsample. I would do the vectorizing first, and upsample in the train. You don't need to convert to a dense array. I see whether I can write an answer. — StupidWolf, Nov 30 '20 at 15:07
You can and should use `vectorizer.transform` instead of `fit_transform` for the test set. And there's probably no reason to cast things to dense arrays nor dataframes. — Ben Reiniger, Nov 30 '20 at 22:46

StupidWolf · Accepted Answer · 2020-11-30T23:54:36.323

It is better to do the countVectorizing and transformation on the whole dataset, split into test and train, and keep it as a sparse matrix without converting back into a data.frame.

For example this is a dataset:

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
                           'at least old','data scientist years','data science is data wrangling', 
                           'This rings particularly','true for data science leaders',
                           'who watch their data','scientists spend days',
                           'painstakingly picking apart','ossified corporate datasets',
                           'arcane Excel spreadsheets','Does data science really',
                           'they just delegate the job','Data Is More Than Just Numbers',
                           'The reason that',
                           'data wrangling is so difficult','data is more than text and numbers'],
                   'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})

We perform the vectorization and transformation, followed by split:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

Up sampling can be done by resampling the index of the minority classes:

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])

And the prediction will work:

upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])

If you have concerns about data leakage, that is some of the information from test actually goes into the training, through the use of TfidfTransformer(). Honestly yet to see concrete proof or demonstration of this, but below is an alternative where you apply the tfid separately:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values, 
                                                              test_size=0.2,random_state=42)

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]

upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)

X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)

This approach may lead to some data leakage. See e.g. https://stats.stackexchange.com/q/154660/232706 — Ben Reiniger, Nov 30 '20 at 22:46
if you do the idf part, yes some of it might. Ok I can edit the answer. Please be a bit less vague and point explicitly to the parts of your link — StupidWolf, Nov 30 '20 at 23:46
@StupidWolf Hi, I have found this question and read your answer. I have a similar problem with overfitting. I do not know if you can be interested in that, but in case you want to have a look, please see here the link: https://stackoverflow.com/questions/65191701/convert-text-encoding-in-ml-classifier . Many thanks — LdM, Dec 09 '20 at 00:58

Oversampling after splitting the dataset - Text classification

1 Answers1

Linked