scikit-learn: How to classify data train and data test with a different features?

Question

My data train: 3 features (permanent data)

My data test: it changes everytime (2 features or 1 feature), in my example code it's 2 features now.

I want to classify with a different feature, because it's a different dimension. How can I achieve this? Below my code:

def classify(a):
    xtrain = np.loadtxt(open("el.csv","rb"), delimiter=",", usecols= (0,1,2), skiprows=1)
    print xtrain
    >>[[ -56.  -82. -110.]  
       [-110. -110. -110.]  
       [ -58. -110.  -79.]  
       [ -56. -110. -110.]  
       [ -57.  -83. -110.]  
       [ -63. -110. -110.]  
       [-110. -110. -110.]]

    ytrain = np.loadtxt(open("el.csv","rb"), delimiter=",", usecols= (3,), dtype=int, skiprows=1)   
    print ytrain
    >>[1 1 2 2 3 3 4]       

    xtest = np.asarray(a)
    xtest = xtest.reshape([1,-1])
    print xtest
    >>[['-83' '-56']]

    knn = neighbors.KNeighborsClassifier(n_neighbors=7, weights='distance') #Fuzzy K-Nearest Neighbor
    knn.fit(xtrain, ytrain)

    results = knn.predict(xtest)
    print results

And the error is:

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 2 while Y.shape[1] == 3

Tonechas · Accepted Answer · 2017-01-23T13:03:21.773

To begin with, let as generate some train and test data:

import numpy as np
xtrain = np.asarray([[ -56.,  -82., -110.],
                     [-110., -110., -110.],
                     [ -58., -110.,  -79.],
                     [ -56., -110., -110.],
                     [ -57.,  -83., -110.],
                     [ -63., -110., -110.],
                     [-110., -110., -110.]], dtype='float')
ytrain = np.asarray([1, 1, 2, 2, 3, 3, 4], dtype='int')

Now you have to create the dictionary knns with an integer key. The value corresponding to key n is a k nearest neighbour classifier which was trained using only the first n features of your training data.

from sklearn.neighbors import KNeighborsClassifier
knns = {}
for n_feats in range(1, xtrain.shape[-1] + 1):
    knns[n_feats] = KNeighborsClassifier(n_neighbors=7, weights='distance')
    knns[n_feats].fit(xtrain[:, :n_feats], ytrain)

The classify function should consume two parameters, namely the test data and the dictionary of classifiers. This way you ensure the classification is performed by a classifier that was trained using exactly the same features of the test data (and discarding the others):

def classify(test_data, classifiers):
    """Classify test_data using classifiers[n], which is the classifier
    trained with the first n features of test_data
    """
    X = np.asarray(test_data, dtype='float')
    n_feats = X.shape[-1]
    return classifiers[n_feats].predict(X)

Demo (notice that test data have to be numbers rather than strings):

In [107]: xtest1 = [[-83, -56]]

In [108]: classify(xtest1, knns)
Out[108]: array([3])

In [109]: xtest2 = [[ -52],
     ...:           [-108],
     ...:           [ -71]]
     ...: 

In [110]: classify(xtest2, knns)
Out[110]: array([2, 1, 3])

In [111]: xtest3 = [[-122,  -87,  -94],
     ...:           [-136,  -99, -107]]
     ...: 

In [112]: classify(xtest3, knns)
Out[112]: array([1, 1])

score 0 · Answer 2 · answered Jan 22 '17 at 14:22

Currently sklearn models do not deal with missing values in the test set. You can maintain several models (trained on different features) and use the appropriate one for each type of data you want to predict. Another option is to fill the missing values for instances that do not have all the features.

scikit-learn: How to classify data train and data test with a different features?

2 Answers2