How to fix errors in scikit machine learning?

Question

I am trying to implement machine learning for a dataset with 1059 rows and 4 columns but I am getting the following error when trying to fit the model with:

knn.fit(myData['RAB'], myData['ETAPE'])

ValueError: Found input variables with inconsistent numbers of samples: [1, 1059]

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Also how can I define multiple predictor variables?

The output of shape is:

(1059, 4)

How can I define more than one predictor variables?

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

plt.style.use('ggplot') 

myData=pd.read_csv('sabmin.csv', sep=';')
print(myData.shape)
knn = KNeighborsClassifier(n_neighbors=6) 
knn.fit(myData['RAB'], myData['ETAPE'])

sascha · Answer 1 · 2017-03-16T12:46:04.457

You are doing it wrong according to sklearn's expected shapes.

Here:

knn.fit(myData['RAB'], myData['ETAPE'])

it seems your are giving one series as input, one as output. Probably not what you want as sklearn will take it as one sample with 1059 dimensions. sklearn's error output is compatible with my guess.

It's hard to know what exactly you are doing, but you need at least to reshape from (1, 1059) to (1059, 1). But i would have also expected you want to make use of more columns, but i don't know.

Alternatively you could create a numpy-matrix earlier to make it easier (myData.as_matrix()) (i'm more of a numpy-based user with sklearn; but many people use pandas because of this name-based indexing).

The former would be something like:

knn.fit(myData['RAB'].reshape(-1, 1), myData['ETAPE'])

I really recommend reading sklearn's docs (one of the best docs ever) and probably also pandas & numpy's docs too to know what's happening exactly.

You may observe that sklearn's huge array of examples are mostly based on numpy-inputs. This is easier for beginners as using pandas is one more layer of complexity (DataFrames, Series, ...).

How can I use the numpy-matrix with the fit model after creation? Additionally reshape is deprecated. So, I d better use the numpy method. — tkyo, Mar 16 '17 at 13:32
reshape is not deprecated. Just one of two possible usages. It's really a good idea to understand more of those core-libs. Start with numpy and read up how to convert pandas-data to numpy-data. — sascha, Mar 16 '17 at 13:39
@IPPOKRATISKARAKOTSOGLOU to convert a pandas to numpy use [pandas.DataFrame.as_matrix](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html) — seralouk, Jul 10 '17 at 13:50

How to fix errors in scikit machine learning?

1 Answers1