0

I have started working on a machine learning project using K-Nearest-Neighbors method on python tensorflow library. I have no experience working with tensorflow tools, so I found some code in github and modified it for my data.

My dataset is like this:

2,2,2,2,0,0,3
2,2,2,2,0,1,0
2,2,2,4,2,2,1
...
2,2,2,4,2,0,0

And this is the code which actually works fine:

import tensorflow as tf
import numpy as np

# Whole dataset => 1428 samples
dataset = 'car-eval-data-1.csv'
# samples for train, remaining for test
samples = 1300
reader = np.loadtxt(open(dataset, "rb"), delimiter=",", skiprows=1, dtype=np.int32)

train_x, train_y = reader[:samples,:5], reader[:samples,6]
test_x, test_y = reader[samples:, :5], reader[samples:, 6]

# Placeholder you can assign values in future. its kind of a variable
#  v = ("variable type",[None,4])  -- you can have multidimensional values here
training_values = tf.placeholder("float",[None,len(train_x[0])])
test_values     = tf.placeholder("float",[len(train_x[0])])

# MANHATTAN distance
distance = tf.abs(tf.reduce_sum(tf.square(tf.subtract(training_values,test_values)),reduction_indices=1))

prediction = tf.arg_min(distance, 0)
init = tf.global_variables_initializer()

accuracy = 0.0

with tf.Session() as sess:
    sess.run(init)
    # Looping through the test set to compare against the training set
    for i in range (len(test_x)):
        # Tensor flow method to get the prediction near to the test parameters in the training set.
        index_in_trainingset = sess.run(prediction, feed_dict={training_values:train_x,test_values:test_x[i]})    

        print("Test %d, and the prediction is %s, the real value is %s"%(i,train_y[index_in_trainingset],test_y[i]))
        if train_y[index_in_trainingset] == test_y[i]:
        # if prediction is right so accuracy increases.
            accuracy += 1. / len(test_x)

print('Accuracy -> ', accuracy * 100, ' %')

The only thing I do not understand is that if it's the KNN method so there has to be some K parameter which defines the number of neighbors for predicting the label for each test sample.
How can we assign the K parameter to tune the number of nearest neighbors for the code?
Is there any way to modify this code to make use of K parameter?

Masoud Masoumi Moghadam
  • 1,094
  • 3
  • 23
  • 45

1 Answers1

1

You're right that the example above does not have the provision to select K-Nearest neighbours. In the code below, I have added the ability to add such a parameter(knn_size) along with other corrections

import tensorflow as tf
import numpy as np

# Whole dataset => 1428 samples
dataset = 'PATH_TO_DATASET_CSV'
knn_size = 1
# samples for train, remaining for test
samples = 1300
reader = np.loadtxt(open(dataset, "rb"), delimiter=",", skiprows=1, dtype=np.int32)

train_x, train_y = reader[:samples,:6], reader[:samples,6]
test_x, test_y = reader[samples:, :6], reader[samples:, 6]

# Placeholder you can assign values in future. its kind of a variable
#  v = ("variable type",[None,4])  -- you can have multidimensional values here
training_values = tf.placeholder("float",[None, len(train_x[0])])
test_values     = tf.placeholder("float",[len(train_x[0])])


# MANHATTAN distance
distance = tf.abs(tf.reduce_sum(tf.square(tf.subtract(training_values,test_values)),reduction_indices=1))

# Here, we multiply the distance by -1 to reverse the magnitude of distances, i.e. the largest distance becomes the smallest distance
# tf.nn.top_k returns the top k values and their indices, here k is controlled by the parameter knn_size 
k_nearest_neighbour_values, k_nearest_neighbour_indices = tf.nn.top_k(tf.scalar_mul(-1,distance),k=knn_size)

#Based on the indices we obtain from the previous step, we locate the exact class label set of the k closest matches in the training data
best_training_labels = tf.gather(train_y,k_nearest_neighbour_indices)

if knn_size==1:
    prediction = tf.squeeze(best_training_labels)
else:
    # Now we make our prediction based on the class label that appears most frequently
    # tf.unique_with_counts() gives us all unique values that appear in a 1-D tensor along with their indices and counts 
    values, indices, counts = tf.unique_with_counts(best_training_labels)
    # This gives us the index of the class label that has repeated the most
    max_count_index = tf.argmax(counts,0)
    #Retrieve the required class label
    prediction = tf.gather(values,max_count_index)




init = tf.global_variables_initializer()

accuracy = 0.0

with tf.Session() as sess:
    sess.run(init)
    # Looping through the test set to compare against the training set
    for i in range (len(test_x)):


        # Tensor flow method to get the prediction near to the test parameters in the training set.
        prediction_value = sess.run([prediction], feed_dict={training_values:train_x,test_values:test_x[i]})

        print("Test %d, and the prediction is %s, the real value is %s"%(i,prediction_value[0],test_y[i]))
        if prediction_value[0] == test_y[i]:
        # if prediction is right so accuracy increases.
            accuracy += 1. / len(test_x)

print('Accuracy -> ', accuracy * 100, ' %')
Savvy
  • 547
  • 5
  • 12
  • It works, but there is an error: `print("Test %d, and the prediction is %s, the real value is %s"%(i,train_y[index_in_trainingset],test_y[i]))` `IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices`. Do you know why that happens? – Masoud Masoumi Moghadam Aug 21 '17 at 07:30
  • The command `sess.run()` returns -1. I guess that's the problem. Do you know why that happens? – Masoud Masoumi Moghadam Aug 21 '17 at 07:52
  • @MasoudMasoumiMoghadam - Yeah, the values will be negative because we are multiplying by -1 to be able to apply tf.nn.top_k(). So I guess I should have taken the absolute value for prediction. I have made the changes accordingly. – Savvy Aug 21 '17 at 08:20
  • @MasoudMasoumiMoghadam - While the above edit will give you positive values for sess.run(prediction), I believe that is not the source of your problem. For your version of the code, sess.run(prediction) returns the index of the distance with minimum value, however for the code I have provided, you actually get the predicted value and not the index. So train_y[index_in_trainingset] looks like the source of your problems. Instead just print "index_in_trainingset" and your if condition should be - " if index_in_trainingset == test_y[i]: " . I think it is better to rename "index_in_trainingset" – Savvy Aug 21 '17 at 08:38
  • I made the changes and the code is working, but I don't know why it keeps predicting the same number. even when I change the value of `K`. I was getting accuracy of 70 % before I use the `K` parameter and now it's less than 20 %. What is your suggestion, pro? – Masoud Masoumi Moghadam Aug 21 '17 at 09:16
  • GIve me a link to the dataset, I'll try to see what's going on later in my day. Also, this is the entire code right? – Savvy Aug 21 '17 at 10:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152459/discussion-between-masoud-masoumi-moghadam-and-savvy). – Masoud Masoumi Moghadam Aug 21 '17 at 16:55