0

I am trying to execute a KNN algorithm from scratch, but I am getting a really strange error saying "KeyError: 0"

I assume this implying I have an empty dictionary somewhere, but I don't understand how that can be. I might just add for the sake of clarity that the data works fine in the black box KNN algorithm, so it definitely has to be something in the code...

This is my code:

import numpy as np
import pandas as pd
import csv
import scipy.stats as stats
import math
from collections import Counter
import operator
from operator import itemgetter


"""Training features dataset"""
filenametrain_data = 'training_data.csv'
training_feature_set = pd.read_csv(filenametrain_data, header=None, usecols=range(1,13627))

"""Training labels dataset"""
filenametrain_label = 'training_labels.csv'
training_feature_label = pd.read_csv(filenametrain_label, header=None, usecols=[1], names=['Category'])

"""Split into training and testing datasets 90%/10%"""
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(training_feature_set, training_feature_label, test_size = 0.1, random_state=42)


"""KNN Model"""
def distance(X_train, y_train):
    dist = 0.0
    for i in range(len(X_train)):
        dist += pow((X_train[i] - y_train[i]), 2)
    return math.sqrt(dist)

def getNeighbors(X_train, y_train, X_test, k):
    distances = []
    for i in range(len(X_train)):
        dist = distance(X_test, X_train[i])
        distances.append((X_train[i], dist, y_train[i]))
    distances.sort(key=operator.itemgetter(1))
    neighbor = []
    for elem in range(k):
        neighbor.append((distances[elem][0], distances[elem][2]))
    return neighbor

def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = int(neighbors[x][-1])
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse = True)
    return sortedVotes[0][0]

"""Prediction"""    
predictions = []
k = 4
for x in range(len(X_test)):
    neighbors = getNeighbors(X_train, y_train, y_test[x], k)
    result = getResponse(neighbors)
    predictions.append(result)   

The error returned is:

Traceback (most recent call last):

File "", line 2, in neighbors = getNeighbors(X_train, y_train, y_test[x], k)

File "C:\ANACONDA\lib\site-packages\pandas\core\frame.py", line 1797, in getitem return self._getitem_column(key)

File "C:\ANACONDA\lib\site-packages\pandas\core\frame.py", line 1804, in _getitem_column return self._get_item_cache(key)

File "C:\ANACONDA\lib\site-packages\pandas\core\generic.py", line 1084, in _get_item_cache values = self._data.get(item)

File "C:\ANACONDA\lib\site-packages\pandas\core\internals.py", line 2851, in get loc = self.items.get_loc(item)

File "C:\ANACONDA\lib\site-packages\pandas\core\index.py", line 1572, in get_loc return self._engine.get_loc(_values_from_object(key))

File "pandas\index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas\index.c:3824)

File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:3704)

File "pandas\hashtable.pyx", line 686, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12280)

File "pandas\hashtable.pyx", line 694, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12231)

KeyError: 0

The datasets can be accessed here

Janne Karila
  • 24,266
  • 6
  • 53
  • 94
Ali
  • 837
  • 2
  • 12
  • 18

1 Answers1

0

EDIT: You may have an extra character at the beginning of your csv files. Try specifying the encoding in the read_csv() calls. See "encoding" in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

encoding : str, default None Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

You're using a dot when you don't need a dot (in two places i can see right off the bat):

operator.itemgetter(1)

You've done an import of itemgetter specifically:

from operator import itemgetter

So when you call itemgetter, just call it without dot notation:

itemgetter(1)
De Novo
  • 7,120
  • 1
  • 23
  • 39
  • Ok sure, so you're saying it should be: key=itemgetter(1) instead of key=operator.itemgetter(1)? – Ali May 02 '17 at 10:44
  • Yep. Both times. And any other time you might have used it that I missed because I am not good at reading. Any time you import a specific method, attribute, or other object from a module (e.g., from operator import itemgetter), you use that object as if it was created right there in the file you're working on. – De Novo May 02 '17 at 10:47
  • Thanks @http://stackoverflow.com/users/7936744/dan-hall I made those changes, but still returning the same error so there must be a more serious error somewhere I'm guessing – Ali May 02 '17 at 10:53
  • I've just done a bit more testing and I'm wondering if it is due to the way I have loaded the data? But it looks ok to me and they all print out... Could it be a local PC thing as lots of code I am executing is returning the same error? Python 2.7 – Ali May 02 '17 at 11:07
  • Just edited it re: you may have an extra character. Try specifying encoding in your read_csv() calls, e.g.: `pd.read_csv(filenametrain_data, header=None, usecols=range(1,13627), encoding="utf-8-sig")` – De Novo May 02 '17 at 11:10
  • Thanks Dan, I just gave that a go, but again no luck and still the same "KeyError:0"... mmm... – Ali May 02 '17 at 12:55
  • I don't think this is the core issue, Dan, although you're correct in that `operator.itemgetter` may be redundant if `itemgetter`'s already in the namespace. – blacksite May 02 '17 at 17:45
  • Any more ideas from people on what this issue might be - I might try and run it on another laptop and see how that goes – Ali May 03 '17 at 02:07