2

I have a multi class text data which I want to SMOTE because of the minority labels. I already did this, but I'm getting sparce matrix as my output.

Is there a way to get the text data back after SMOTE?

Here is my code sample:

X_train = df['transcript']
y_train = df['label']
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state = 2) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
Eniola
  • 133
  • 10
  • Can you post your complete code? – Venkatachalam Jul 17 '20 at 14:11
  • Thanks, I've made adjustment to my code. the data is in a dataframe with two columns: label and transcript. I've already done the cleaning. But some label are minor that's why I need to SMOTE it. Aside from the cleaning, I haven't done anything else. – Eniola Jul 17 '20 at 14:15

2 Answers2

2

SMOTE.fit_sample uses label_binarize from Scikit-learn internally: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/12b2e0d/imblearn/base.py#L87

You should manually use sklearn.preprocessing.LabelBinarizer on the y values before applying SMOTE.

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelBinarizer

sm = SMOTE(random_state = 2)
lb = LabelBinarizer()
y_train_bin = lb.fit_transform(y_train)
X_train_res, y_train_res_bin = sm.fit_sample(X_train, y_train_bin)

Then you can recover the text labels from the fitted LabelBinarizer.inverse_transform method:

y_train_res = lb.inverse_transform(y_train_res_bin)
shadowtalker
  • 12,529
  • 3
  • 53
  • 96
  • I've tried this before too, but I was getting an error that says: ValueError: could not convert string to float: – Eniola Jul 17 '20 at 14:20
  • This line: 'X_train_res, y_train_res_bin = sm.fit_sample(X_train, y_train_bin)' – Eniola Jul 17 '20 at 14:27
2

Actually SMOTE expects X to be numerical data only. This is not the problem with the labels, which can be strings.

Read here to understand how SMOTE works internally. Basically it creates a synthetic datapoint for minority class using the convex combination of chosen neighbours.

So, convert your text data (transcripts) into numericals using TfidfVectorizer or CountVectorizer. You could use the inverse_transform method of these vectorizers to get back the text, but the problem is that you would loose the order of words.

import pandas as pd
df = pd.DataFrame({'transcripts': ['I want to check this',
                                    'how about one more sentence',
                                    'hopefully this works well fr you',
                                    'I want to check this',
                                    'This is the last sentence or transcript'],
                    'labels': ['good','bad', 'bad', 'good','bad']})
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(df['transcripts'])

from imblearn.over_sampling import SMOTE 
sm = SMOTE(k_neighbors=1, random_state = 2) 
X_train_res, y_train_res = sm.fit_sample(X, df.labels) 


vec.inverse_transform(X_train_res)
# [array(['this', 'check', 'to', 'want'], dtype='<U10'),
#  array(['sentence', 'more', 'one', 'about', 'how'], dtype='<U10'),
#  array(['you', 'fr', 'well', 'works', 'hopefully', 'this'], dtype='<U10'),
#  array(['this', 'check', 'to', 'want'], dtype='<U10'),
#  array(['transcript', 'or', 'last', 'the', 'is', 'sentence', 'this'],
#        dtype='<U10'),
#  array(['want', 'to', 'check', 'this'], dtype='<U10')]
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • Thanks very much for this... this works. I want to have the `vec.inverse_transform(X_train_res)` in a dataframe with respective labels after tranforming. ls that possible? – Eniola Jul 17 '20 at 16:54