How can i apply SMOTE for multiclass text data

Question

I have a Multiclass dataset for which i want to use SMOTE, but i am facing an

ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.

I want to balance my data using SMOTE or any other technique for multi class and my raw data is in text.

Added data to understand If there is a file called my_data, this has 5 categories: car,bike,bicycle,bus and pedestrian.
Where there are 56 data records for car category,18 data records in bus category, 11 data records in bike category, 10 data records in bicycle category and 5 data records in pedestrian category

Please [edit] your question and add the details. Do not use comments to clarify. Code in comments is unreadable, and comments may or may not be shown initially. Make it easy for people to help you. Make sure your code is a [mcve]. What happens when you run your code? What do you expect to happen instead? Include any errors you may get. You may also want to read [ask]. — Robert, May 13 '19 at 20:19
I'm in the same situation as you. Did you solve it ? If so, how? Thanks! — BringBackCommodore64, Jul 18 '19 at 16:22
This may give you ideas: https://machinelearningmastery.com/multi-class-imbalanced-classification/ — belwood, Nov 27 '20 at 06:44

score 0 · Answer 1 · answered May 15 '19 at 20:28

0

It depends on the datatype that you are passing. Here is the official documentation and you will find all the possible values.

answered May 15 '19 at 20:28

andr3s2

238
3
10

so, do i need to featurize the text data and then split into train and test? – ANURAG SINGH May 16 '19 at 07:10

score 0 · Answer 2 · answered May 09 '23 at 06:01

so if you are working with python and pandas simply do this get read your csv file into pandas.

import pandas as pd
df = pd.read_csv('my_data.csv')
# get a count of the target class
df['target'].value_counts()

since car has 56 data records, and the rest has much lower number of data records, you can upscale the data records of the other four categories(bike,bus,pedestrian) to match the number of data records of the car category. Extract the data records of car category

car  = df[(df.target == "car") ]
car

Then Extract the data records of bike category

bike  = df[(df.target == "bike") ]
bike

bike_and_car=car
#merge the bike and car dataframes
bike_and_car = bike_and_car.append(bike, ignore_index = True)
bike_and_car
bike_and_car = bike_and_car.replace('car',1)
bike_and_car = bike_and_car.replace('bike',0)

X=bike_and_car.drop(['target'], axis=1)
y=bike_and_car['target]

Now import the smote module SMOTENC or SVMSMOTE from imblearn library

from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(random_state=42)
X_res, y_res = smote_nc.fit_resample(X, y)
X_res

X_res should contain additional generated synthetic data records to the original data you initially had for the bike category repeat the same procedure for other categories: bicycle and pedestrian
You can downlaod that dataframe X_res as a csv

df.to_csv(r'C:\Users\Moon\Documents\my_data_with_synthetic_data.csv', index=False)

score 0 · Answer 3 · answered May 15 '23 at 15:17

0

If using SMOTE is not mandatory, I'd try CTGAN which handles categorical features and can generate new synthetic without this need to manipulate the data. You can explore a solution with ydata-synthetic for instance, and if you have a .csv, you can simply try it out with the streamlit app, which is super intuitive.

answered May 15 '23 at 15:17

SeaEngineering

36
2

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 15 '23 at 23:37

How can i apply SMOTE for multiclass text data

3 Answers3