my data has 14 attributes and 303 observations but when applying knn value of k greater than 1 is giving error

Question

I am getting this error -

     ValueError: Expected n_neighbors <= n_samples,  but n_samples = 1, n_neighbors = 11

the data I am using has 14 attributes and 303 observations. I want the number of neighbors to be 11(anything greater than one) but this error is showing up every time.

here is my code-

     import pandas as pd

    header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
    dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_cleaned_data.csv', names= header_names)

    training_sizes = [1,25,50,75,100,150,200]

    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import learning_curve

    features = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal']
    target = 'num'

    from sklearn.neighbors import KNeighborsClassifier
    train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=1), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')

here is the traceback of the error-

          Traceback (most recent call last):
  File "E:\HCU proj doc\heart_disease_scaling_and_learning_curve.py", line 15, in <module>
    train_size, train_scores, validation_scores = learning_curve(estimator = KNeighborsClassifier(n_neighbors=11), X=dataset[features], y=dataset[target], train_sizes=training_sizes, cv=5, scoring='neg_log_loss')
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 1128, in learning_curve
    for train, test in train_test_proportions)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 488, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\model_selection\_validation.py", line 528, in _score
    score = scorer(estimator, X_test, y_test)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\metrics\scorer.py", line 138, in __call__
    y_pred = clf.predict_proba(X)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\classification.py", line 190, in predict_proba
    neigh_dist, neigh_ind = self.kneighbors(X)
  File "C:\Users\Sharanya\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\neighbors\base.py", line 347, in kneighbors
    (train_size, n_neighbors)
ValueError: Expected n_neighbors <= n_samples,  but n_samples = 1, n_neighbors = 11

what is the problem ? what is going wrong in the code? what should I do to remove the error ?

Possible duplicate of [ValueError: Expected n\_neighbors <= 1. Got 5 -Scikit K Nearest Classifier](https://stackoverflow.com/questions/29999297/valueerror-expected-n-neighbors-1-got-5-scikit-k-nearest-classifier) — desertnaut, Jul 03 '18 at 12:06
I already checked it. but still the reason and possible solution is not clear to me. Can u plz help me with this — Sri991, Jul 03 '18 at 12:12
No specific ideas; for starters, I suggest to start trimming your `training_sizes` from the left (i.e. exclude `1`), or even leaving the `train_sizes` argument to its default value — desertnaut, Jul 03 '18 at 12:21

score 1 · Answer 1 · answered Jul 03 '18 at 13:16

1

I suspect that the problem concerns the way you are defining your target vector. try replacing this:

target = 'num'

with this:

target = ['num']

hope this helps

answered Jul 03 '18 at 13:16

Nazim Kerimbekov

4,712
8
34
58

but this time another erroris showing up and I don't any parameter of that name in the documentation- ValueError: y_true contains only one label (0.0). Please provide the true labels explicitly through the labels argument. As yu can see in the above code there is no such variable y_true and there is no such parameter in the documentation – Sri991 Jul 03 '18 at 13:28
what error are you getting and what parameters are you using? – Nazim Kerimbekov Jul 03 '18 at 13:30
this is the error - ValueError: y_true contains only one label (0.0). Please provide the true labels explicitly through the labels argument. – Sri991 Jul 03 '18 at 13:30
and I just did what you said – Sri991 Jul 03 '18 at 13:31
hmmm well, do you know what line is causing the problem (it should be written in the console) – Nazim Kerimbekov Jul 03 '18 at 13:36
at k=11 or any value greater then 1 it still showing the same error – Sri991 Jul 03 '18 at 13:36
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174264/discussion-between-fozoro-and-sri991). – Nazim Kerimbekov Jul 03 '18 at 13:37
This will just change the dimension of y from `(n_samples,)` to `(n_samples, 1)` and nothing else. So will not solve the problem. – Vivek Kumar Jul 04 '18 at 07:18

score 1 · Accepted Answer · answered Jul 04 '18 at 07:29

Your task is binary. So when you set the training_size=1, only a single sample is passed to the scoring function (log_loss in this case).

So either 0.0 or 1.0 will be there be there. Thats the error. You need to supply all the labels to the metric function so that it can calculate the loss.

To solve this, you can do multiple things:

1) Don't pass the training_sizes to the learning_curve as @desertnaut said, and let it use the default. In that case, the training data will be divided into 5 equally spaced incremental parts which (in most cases) will contain both the labels in training set and log_loss will automatically identify them to calculate the score.

2) Change the training_sizes values to something more meaningful. Maybe just remove the 1 from it.

training_sizes = [25,50,75,100,150,200]

This is working for me for your data.

3) Change the scoring param to pass all the labels explicitly to the log_loss. So that even if you specify 1 in training_sizes, the log_loss method knows that the data have 2 labels and calculated the loss accordingly.

from sklearn.metrics import log_loss

# This will calculate the 'neg_log_loss' as you wanted, just with one extra param
scorer = make_scorer(log_loss, greater_is_better=False, 
                     needs_proba=True, 
                     labels=[0.0, 1.0])   #<== This is what you need.

And then, do this:

....
.... 
train_size, train_scores, validation_scores = learning_curve(KNeighborsClassifier(n_neighbors=1), 
                                           X=dd[features], 
                                           y=dd[target],
                                           train_sizes=training_sizes, 
                                           cv=5, 
                                           scoring=scorer)  #<== Add that here

thanks Vivek its working, I was stuck on this for days and tried different solution. anyway I am a beginner and trying to understand stuffs by doing it. thanks for your help. — Sri991, Jul 04 '18 at 08:12

my data has 14 attributes and 303 observations but when applying knn value of k greater than 1 is giving error

2 Answers2