Python: ValueError: could not convert string to float when apply for down-sampling

Question

I have imbalance dataset as below

id text               category
1  comment1               0 
2  comment2               0 
3  comment3               1 
4  comment4               0

I have pre-processed the "text" by removing numeric values and applying stemming.

Next, I split my data to training and testing set for validation.

X_train, X_test, y_train, y_test = train_test_split(data['text'], data['category'])

Then, I'm applying Down-Sampling method on my training dataset

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(return_indices=True)

train_X_resampled, train_y_resampled, idx_resampled = rus.sample(X_train, y_train)

However, when I got the error as below. Can I know how can i fix the error?

ValueError: could not convert string to float: 'comment2'

You cannot use text data. You need to completely replace it by using different techniques. — Vivek Kumar, Mar 11 '18 at 01:54
How about I apply Down-Sampling first before splitting data to training and testing set for validation? I know that if using SMOTE or other synthetic sampling method will prone to data leakage. However, since I'm using down-sampling where no replacement is needed for my data. Is it possible for me to apply down-sampling method then only do train-test split? Cause the error only comes after i applied sampling to my training data — CHONG, Mar 11 '18 at 02:11
It doesnt matter. The libraries you are using doesnt support text data. You need to replace that. — Vivek Kumar, Mar 11 '18 at 02:13
I see. So what about if i try to change to other library for sampling method? (E.g. sklearn.utils.resample) Cause I have tried another way which is down sampling first then only apply train test split and have no issues with it after using sklearn.utils.resample. However, I'm just concerned that it's a wrong way applying downsampling after train-test split — CHONG, Mar 11 '18 at 02:22

score 0 · Answer 1 · answered Jun 04 '18 at 09:42

0

imblearn doesn't support dataframes, convert your column(s) of interest to a list and then reshape it into a 2d array using np.array(list(data['text'])).reshape(-1, 1) and it would work.

answered Jun 04 '18 at 09:42

A_Matar

2,210
3
31
53

Python: ValueError: could not convert string to float when apply for down-sampling

1 Answers1