0

I have imbalance dataset as below

id text               category
1  comment1               0 
2  comment2               0 
3  comment3               1 
4  comment4               0 

I have pre-processed the "text" by removing numeric values and applying stemming.

Next, I split my data to training and testing set for validation.

X_train, X_test, y_train, y_test = train_test_split(data['text'], data['category'])

Then, I'm applying Down-Sampling method on my training dataset

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(return_indices=True)

train_X_resampled, train_y_resampled, idx_resampled = rus.sample(X_train, y_train)

However, when I got the error as below. Can I know how can i fix the error?

ValueError: could not convert string to float: 'comment2'
aylr
  • 359
  • 1
  • 9
CHONG
  • 373
  • 1
  • 5
  • 13
  • You cannot use text data. You need to completely replace it by using different techniques. – Vivek Kumar Mar 11 '18 at 01:54
  • How about I apply Down-Sampling first before splitting data to training and testing set for validation? I know that if using SMOTE or other synthetic sampling method will prone to data leakage. However, since I'm using down-sampling where no replacement is needed for my data. Is it possible for me to apply down-sampling method then only do train-test split? Cause the error only comes after i applied sampling to my training data – CHONG Mar 11 '18 at 02:11
  • It doesnt matter. The libraries you are using doesnt support text data. You need to replace that. – Vivek Kumar Mar 11 '18 at 02:13
  • I see. So what about if i try to change to other library for sampling method? (E.g. sklearn.utils.resample) Cause I have tried another way which is down sampling first then only apply train test split and have no issues with it after using sklearn.utils.resample. However, I'm just concerned that it's a wrong way applying downsampling after train-test split – CHONG Mar 11 '18 at 02:22

1 Answers1

0

imblearn doesn't support dataframes, convert your column(s) of interest to a list and then reshape it into a 2d array using np.array(list(data['text'])).reshape(-1, 1) and it would work.

A_Matar
  • 2,210
  • 3
  • 31
  • 53