Dataset 0-9 columns: float features (parameters of a product) 10 column: int labels (products)
Goal
Calculate an 0-1 classification certainty score for the labels (this is what my current code should do)
Calculate the same certainty score for each “product_name”(300 columns) at each rows(22'000)
ERROR I use sklearn.tree.DecisionTreeClassifier. I am trying to use "predict_proba" but it gives an error.
Python CODE
data_train = pd.read_csv('data.csv')
features = data_train.columns[:-1]
labels = data_train.columns[-1]
x_features = data_train[features]
x_label = data_train[labels]
X_train, X_test, y_train, y_test = train_test_split(x_features, x_label, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
clf = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
class_probabilitiesDec = clf.predict_proba(y_train)
#ERORR: ValueError: Number of features of the model must match the input. Model n_features is 10 and input n_features is 16722
print('Decision Tree Classification Accuracy Training Score (max_depth=3): {:.2f}'.format(clf.score(X_train, y_train)*100) + ('%'))
print('Decision Tree Classification Accuracy Test Score (max_depth=3): {:.2f}'.format(clf.score(X_test, y_test)*100) + ('%'))
print(class_probabilitiesDec[:10])
# if I use X_tranin than it jsut prints out a buch of 41 element vectors: [[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
[....
FEATURES (COLUMNS)
(last columns are the labels) 0 1 1 1 1.0 1462293561 1462293561 0 0 0.0 0.0 1 1 2 2 2 8.0 1460211580 1461091152 1 1 0.0 0.0 2 2 3 3 3 1.0 1469869039 1470560880 1 1 0.0 0.0 3 3 4 4 4 1.0 1461482675 1461482675 0 0 0.0 0.0 4 4 5 5 5 5.0 1462173043 1462386863 1 1 0.0 0.0 5
CLASSES COLUMNS (300 COLUMNS OF ITEMS)
HEADER ROW: apple gameboy battery .... SCORE in 1st row: 0.763 0.346 0.345 .... SCORE in 2nd row: 0.256 0.732 0.935 ....
ex.: of similar scores used when someone image classify cat VS. dog and the classification gives confidence scores.