Python SVM Classifier - issues with input NaNs and data shape

Question

I am trying to build a binary SVM classifier with ECG data to diagnose sleep apnea. With 16,000 odd inputs I'm performing wavelet transform, manually extracting HRV features and storing them in a feature list, and feeding this list into the classifier.

This worked fine with the raw data before I preprocessed it with the Wavelet transform step - some values in the feature list became nan after the transform which meant I got this error for this line of code:

clf.fit(X_train, y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

so I executed this step:

x = pd.DataFrame(data=X_train)
x=x[~x.isin([np.nan, np.inf, -np.inf]).any(1)]

which solved the ValueError but removing the 'faulty' inputs meant the shapes of X_train and y_train don't match up:

clf.fit(x, y_train)

#error
Found input variables with inconsistent numbers of samples: [11255, 11627]

I am struggling to figure out how to remove the corresponding values from y_train to match up the samples? Or is there a better approach to this?

Please let me know if you need more info on the code.

It looks like you've resized your `X_Train` but not `Y_Train`. Try adding this: `Y_Train=Y_Train[~x.isin([np.nan, np.inf, -np.inf]).any(1)]` — gnodab, May 01 '20 at 13:36
Hi thank you for your response! But I got the same error for this code: `a = pd.DataFrame(data=X_train) b = pd.DataFrame(data=y_train) a=a[~a.isin([np.nan, np.inf, -np.inf]).any(1)] b=b[~b.isin([np.nan, np.inf, -np.inf]).any(1)]` — Sakshi Kumar, May 01 '20 at 15:23
Could you return the shape of `X_train` and `y_train`, please? — Anwarvic, May 01 '20 at 15:33
@SakshiKumar In the snippit in the comments you are checking for `NaN` in `b` (or `y_train`). Don't do that, instead check for `NaN` in `a` or `X_train`. Like so: `a = pd.DataFrame(data=X_train) b = pd.DataFrame(data=y_train) a=a[~a.isin([np.nan, np.inf, -np.inf]).any(1)] b=b[~a.isin([np.nan, np.inf, -np.inf]).any(1)]` — gnodab, May 01 '20 at 15:55

score 2 · Answer 1 · answered May 01 '20 at 16:28

Without sample data, it is impossible to test. But you are testing for valid data in the X_train dataframe. Which is good. Now you just need to remove the corresponding Y_train labels. Something like this:

x = pd.DataFrame(data=X_train)
valid_indexes = ~x.isin([np.nan, np.inf, -np.inf]).any(1)
x=x[valid_indexes]

Y_train = Y_train[valid_indexes]

Make sure you are always testing for valid data on the X_train data. This is because, I presume, that all of the labels are valid.

This worked perfectly and I applied the same to the test samples. Thank you so much!! — Sakshi Kumar, May 03 '20 at 23:50

Python SVM Classifier - issues with input NaNs and data shape

1 Answers1