2

Dataset 0-9 columns: float features (parameters of a product) 10 column: int labels (products)

Goal

  1. Calculate an 0-1 classification certainty score for the labels (this is what my current code should do)

  2. Calculate the same certainty score for each “product_name”(300 columns) at each rows(22'000)

ERROR I use sklearn.tree.DecisionTreeClassifier. I am trying to use "predict_proba" but it gives an error.

Python CODE

data_train = pd.read_csv('data.csv')
features = data_train.columns[:-1]
labels = data_train.columns[-1]
x_features = data_train[features]
x_label = data_train[labels]
X_train, X_test, y_train, y_test = train_test_split(x_features, x_label, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

clf = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
class_probabilitiesDec = clf.predict_proba(y_train) 
#ERORR: ValueError: Number of features of the model must match the input. Model n_features is 10 and input n_features is 16722 


print('Decision Tree Classification Accuracy Training Score (max_depth=3): {:.2f}'.format(clf.score(X_train, y_train)*100) + ('%'))
print('Decision Tree Classification Accuracy Test Score (max_depth=3): {:.2f}'.format(clf.score(X_test, y_test)*100) + ('%'))

print(class_probabilitiesDec[:10])
# if I use X_tranin than it jsut prints out a buch of 41 element vectors: [[ 0.00490808  0.00765327  0.01123035  0.00332751  0.00665502  0.00357707
   0.05182597  0.03169453  0.04267532  0.02761833  0.01988187  0.01281091
   0.02936528  0.03934781  0.02329257  0.02961484  0.0353548   0.02503951
   0.03577073  0.04700108  0.07661592  0.04433907  0.03019715  0.02196157
   0.0108976   0.0074869   0.0291989   0.03951418  0.01372598  0.0176358
   0.02345895  0.0169703   0.02487314  0.01813493  0.0482489   0.01988187
   0.03252641  0.01572249  0.01455786  0.00457533  0.00083188]
 [....

FEATURES (COLUMNS)

(last columns are the labels) 0 1 1 1 1.0 1462293561 1462293561 0 0 0.0 0.0 1 1 2 2 2 8.0 1460211580 1461091152 1 1 0.0 0.0 2 2 3 3 3 1.0 1469869039 1470560880 1 1 0.0 0.0 3 3 4 4 4 1.0 1461482675 1461482675 0 0 0.0 0.0 4 4 5 5 5 5.0 1462173043 1462386863 1 1 0.0 0.0 5

CLASSES COLUMNS (300 COLUMNS OF ITEMS)

HEADER ROW: apple gameboy battery .... SCORE in 1st row: 0.763 0.346 0.345 .... SCORE in 2nd row: 0.256 0.732 0.935 ....

ex.: of similar scores used when someone image classify cat VS. dog and the classification gives confidence scores.

sogu
  • 2,738
  • 5
  • 31
  • 90
  • What do you call 0-1 certainty score? – Dr. Snoopy May 29 '19 at 12:03
  • Real numbers between 0 and 1 ex.: 0.753 or 0.001 – sogu May 29 '19 at 12:22
  • 1
    predict_proba does that, what is the issue? – Dr. Snoopy May 29 '19 at 12:31
  • As I have mentioned I am using predict_proba, I am not a sure what to use, but I have clear goals. Calculate certainty score for each “product_name”(300 columns) at each rows(22'000) – sogu May 29 '19 at 12:56
  • Sorry but I still don't get what is the *actual* problem, you need to be clear or else nobody will be able to answer your question. If you have errors you have to include them in your question. – Dr. Snoopy May 29 '19 at 13:03
  • I have edited the post at the bottom you find "FEATURES (COLUMNS)" what I have and the "CLASSES COLUMNS (300 COLUMNS OF ITEMS)" what I want to get. – sogu May 29 '19 at 13:32
  • I need all the rows * at every single types of classes = 22'000 rows (all features) * 300 classes = 6'600'000 pieces of probability scores – sogu May 29 '19 at 17:13

1 Answers1

5

You cannot predict the probability of your labels.

predict_proba predicts the probability for each label from your X Data, thus:

class_probabilitiesDec = clf.predict_proba(X_test) 

What you postet as "when i use X_train":

[[ 0.00490808  0.00765327  0.01123035  0.00332751  0.00665502  0.00357707
   0.05182597  0.03169453  0.04267532  0.02761833  0.01988187  0.01281091
   0.02936528  0.03934781  0.02329257  0.02961484  0.0353548   0.02503951
   0.03577073  0.04700108  0.07661592  0.04433907  0.03019715  0.02196157
   0.0108976   0.0074869   0.0291989   0.03951418  0.01372598  0.0176358
   0.02345895  0.0169703   0.02487314  0.01813493  0.0482489   0.01988187
   0.03252641  0.01572249  0.01455786  0.00457533  0.00083188]

Is a list of the probability to be true for every possible label.

EDIT

After reading your comments predict proba is exactly what you want.

Lets make an example. In the following code we have a classifier with 3 classes: either 11, 12 or 13.

If the input is 1 the classifier should predict 11

If the input is 2 the classifier should predict 12

...

If the input is 7 the classifier should predict 13

clf = DecisionTreeClassifier()
clf.fit([[1],[2],[3],[4],[5],[6],[7]], [[11],[12],[13],[13],[12],[11],[13]])

now if you have test data with a single row e.g. 5 than the classifier should predict 12. So lets try that.

clf.predict([[5]])

And voila: the result is array([12])

if we want a probability then predict proba is the way to go:

clf.predict_proba([[5]])

and we get [array([0., 1., 0.])]

In that case the array [0., 1., 0.] means :

0% probability for class 11

100% probability for class 12

0% probability for class 13

If i'm correct thats exactly what you want. You can even map that to the names of your classes with:

probabilities = clf.predict_proba([[5]])[0]
{clf.classes_[i] : probabilities[i] for i in range(len(probabilities))}

which gives you a dictionary with probabilities for class names:

{11: 0.0, 12: 1.0, 13: 0.0}

Now in your case you have way more classes than only [11,12,13] so the array gets longer. And for every row in your dataset predict_proba creates an array, so for more than a single row of data your output becomes a matrix.

Florian H
  • 3,052
  • 2
  • 14
  • 25
  • 1
    Than what about calculate a 0-1 certainty score for each “product_name”(300 columns) at each rows(22'000) – sogu May 29 '19 at 11:24
  • try clf.predict(X_test) – Florian H May 29 '19 at 11:25
  • it gives me back a large array of array([21, 7, 21, 21, 7,... what does not look similar to the label nor to the features. – sogu May 29 '19 at 11:29
  • 21 is the result for your first data row, 7 is the result for your second data row. So in your first row of data 21 is True and the rest of your possible labels are False. Hope it helps, otherwise try to post an example with small dataset, input and expected output – Florian H May 29 '19 at 11:34
  • What I need is to calculate a 0-1 certainty score for each “product_name”(300 columns) at each rows(22'000) – sogu May 29 '19 at 11:37
  • Added some examples. Thank you. – sogu May 29 '19 at 11:50
  • clf.predict_proba([[5]]) gives me a strange large vector: array([[ 0.00497512, 0. , 0.00497512, 0.00497512, 0. , 0. , 0.01492537, 0. , 0. , 0. , 0. , 0. , 0. , 0.08955224, 0. , 0. , 0.07960199, 0. , 0. , 0.00497512, 0.00497512, 0.01492537, 0. , 0. , 0.04975124, 0.00497512, 0.02487562, 0.00995025, 0.00995025, 0. , 0. , 0.01492537, ..... so it looks like exactly what I need. – sogu May 29 '19 at 15:24
  • {11: 0.0, 12: 1.0, 13: 0.0} this part is not what I want. I need all the rows * at every single types of classes = 22'000 rows (all features) * 300 classes = 6'600'000 pieces of probability scores. By the way I am already blown way by the help so no worries. #like – sogu May 29 '19 at 17:03
  • if you call predict_proba with all you rows you get exactly that. A Matrix with 300 columns (probability for every class) and 22000 rows (300 probabilities for 22000 rows) – Florian H Jun 03 '19 at 07:21