0

i dont know why but im getting this error ? GetDummies is removing one column for unknown reason. I want both 'train' and 'test' data to have same no of columns.

    data = pd.read_csv('data/trainData.csv')
    train , test = train_test_split(data , test_size= 0.20 )
    train = pd.get_dummies(train , columns =['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'] , drop_first = True)
    c = DecisionTreeClassifier(min_samples_split=550)
    test  = pd.get_dummies(test , columns = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'] , drop_first = True)
    train1 = train.iloc[:,0:9]
    train2 = train.iloc[:,10:]
    X_train =pd.concat([train1 , train2] , axis =1)
    test1 = test.iloc[:,0:9]
    test2 = test.iloc[:,10:]
    X_test =pd.concat([test1 , test2] , axis =1)
    y_train = train["Class"]
    dt = c.fit(X_train , y_train)
    y_true = test["Class"]
    y_true = y_true.values
    y_scores = c.predict(X_test)

The error im getting is below..

 `ValueError                                Traceback (most recentcall last)
 <ipython-input-20-9cc441bd0222> in <module>()
 13 y_true = test["Class"]
 14 y_true = y_true.values
  ---> 15 y_scores = c.predict(X_test)

  /home/ram98/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in predict(self, X, check_input)
410         """
411         check_is_fitted(self, 'tree_')
--> 412         X = self._validate_X_predict(X, check_input)
413         proba = self.tree_.predict(X)
414         n_samples = X.shape[0]

/home/ram98/anaconda3/lib/python3.6/site-packages/sklearn/tree/tree.py in _validate_X_predict(self, X, check_input)
382                              "match the input. Model n_features is %s and "
383                              "input n_features is %s "
--> 384                              % (self.n_features_, n_features))
385 
386         return X

ValueError: Number of features of the model must match the input. Model n_features is 52 and input n_features is 51 `
RAM
  • 211
  • 1
  • 4
  • 14
  • I suspect you are getting different some levels on the train variables that are not present on the test data, so `get_dummies` gives a different number of columns. One solution would be to try `get_dummies` before the train/test split. Let me know if that works. – ags29 Nov 01 '17 at 19:12
  • @ags29 ,Thanks It worked like a charm :) , But i didnt understand why this happened ? Can you please explain in layman terms? – RAM Nov 01 '17 at 19:21
  • 4
    Here is an example: if you have column with categories 1,2,3 on your dataset, when you split into train/test you could end up with all the 1's and 2's on the train and all the 3's on test. Then when you run `get_dummies`, the result will have 2 columns on train and only 1 on test. However if you run `get_dummies` on the full dataset you would get 3 columns and then when you split into train/test after that, your splits will be aligned. Does that make sense? – ags29 Nov 01 '17 at 19:31
  • @ags29 Thanks, I got it now.Ur explanation is fantasic!! – RAM Nov 03 '17 at 11:27

0 Answers0