1

I am trying to build a binary SVM classifier with ECG data to diagnose sleep apnea. With 16,000 odd inputs I'm performing wavelet transform, manually extracting HRV features and storing them in a feature list, and feeding this list into the classifier.

This worked fine with the raw data before I preprocessed it with the Wavelet transform step - some values in the feature list became nan after the transform which meant I got this error for this line of code:

clf.fit(X_train, y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

so I executed this step:

x = pd.DataFrame(data=X_train)
x=x[~x.isin([np.nan, np.inf, -np.inf]).any(1)]

which solved the ValueError but removing the 'faulty' inputs meant the shapes of X_train and y_train don't match up:

clf.fit(x, y_train)

#error
Found input variables with inconsistent numbers of samples: [11255, 11627]

I am struggling to figure out how to remove the corresponding values from y_train to match up the samples? Or is there a better approach to this?

Please let me know if you need more info on the code.

  • It looks like you've resized your `X_Train` but not `Y_Train`. Try adding this: `Y_Train=Y_Train[~x.isin([np.nan, np.inf, -np.inf]).any(1)]` – gnodab May 01 '20 at 13:36
  • Hi thank you for your response! But I got the same error for this code: `a = pd.DataFrame(data=X_train) b = pd.DataFrame(data=y_train) a=a[~a.isin([np.nan, np.inf, -np.inf]).any(1)] b=b[~b.isin([np.nan, np.inf, -np.inf]).any(1)]` – Sakshi Kumar May 01 '20 at 15:23
  • Could you return the shape of `X_train` and `y_train`, please? – Anwarvic May 01 '20 at 15:33
  • x.shape = (11276, 9) y.shape = (11627, 1) – Sakshi Kumar May 01 '20 at 15:49
  • @SakshiKumar In the snippit in the comments you are checking for `NaN` in `b` (or `y_train`). Don't do that, instead check for `NaN` in `a` or `X_train`. Like so: `a = pd.DataFrame(data=X_train) b = pd.DataFrame(data=y_train) a=a[~a.isin([np.nan, np.inf, -np.inf]).any(1)] b=b[~a.isin([np.nan, np.inf, -np.inf]).any(1)]` – gnodab May 01 '20 at 15:55

1 Answers1

2

Without sample data, it is impossible to test. But you are testing for valid data in the X_train dataframe. Which is good. Now you just need to remove the corresponding Y_train labels. Something like this:

x = pd.DataFrame(data=X_train)
valid_indexes = ~x.isin([np.nan, np.inf, -np.inf]).any(1)
x=x[valid_indexes]

Y_train = Y_train[valid_indexes]

Make sure you are always testing for valid data on the X_train data. This is because, I presume, that all of the labels are valid.

gnodab
  • 850
  • 6
  • 15