0

The decision tree classification gives an accuracy of 0.52 but I want to increase the accuracy. How can I increase the accuracy by using any of the classification model available in sklearn.

I have used knn, decision tree, and cross-validation but all of them gives less accuracy.

Thanks

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

#read from the csv file and return a Pandas DataFrame.
nba = pd.read_csv('wine.csv')

# print the column names
original_headers = list(nba.columns.values)
print(original_headers)

#print the first three rows.
print(nba[0:3])

# "Position (pos)" is the class attribute we are predicting. 
class_column = 'quality'

#The dataset contains attributes such as player name and team name. 
#We know that they are not useful for classification and thus do not 
#include them as features. 
feature_columns = ['fixed acidity', 'volatile acidity', 'citric     acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur         dioxide', 'density', 'pH','sulphates', 'alcohol']

#Pandas DataFrame allows you to select columns. 
#We use column selection to split the data into features and class. 
nba_feature = nba[feature_columns]
nba_class = nba[class_column]

print(nba_feature[0:3])
print(list(nba_class[0:3]))

train_feature, test_feature, train_class, test_class = \
train_test_split(nba_feature, nba_class, stratify=nba_class, \
train_size=0.75, test_size=0.25)

training_accuracy = []
test_accuracy = []

knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn.fit(train_feature, train_class)
prediction = knn.predict(test_feature)
print("Test set predictions:\n{}".format(prediction))
print("Test set accuracy: {:.2f}".format(knn.score(test_feature, test_class)))

train_class_df = pd.DataFrame(train_class,columns=[class_column])     
train_data_df = pd.merge(train_class_df, train_feature, left_index=True, right_index=True)
train_data_df.to_csv('train_data.csv', index=False)

temp_df = pd.DataFrame(test_class,columns=[class_column])
temp_df['Predicted Pos']=pd.Series(prediction, index=temp_df.index)
test_data_df = pd.merge(temp_df, test_feature, left_index=True, right_index=True)
test_data_df.to_csv('test_data.csv', index=False)

tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(train_feature, train_class)
print("Training set score: {:.3f}".format(tree.score(train_feature, train_class)))
print("Test set score Decision: {:.3f}".format(tree.score(test_feature, test_class)))

prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
cancer = nba.as_matrix()
tree = DecisionTreeClassifier(max_depth=4, random_state=0)

scores = cross_val_score(tree, train_feature,train_class, cv=10)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
bibek
  • 167
  • 1
  • 4
  • 12

2 Answers2

0

Usually the next step after DT are RF (and it's neighbors) or XGBoost (but it's not sklearn). Try them. And DT are very simple to overfit.

Remove outliers. Check classes in your dataset: if they are unbalanced, most of errors may be there. In this case you need to use weights while fitting or in metric function (or use f1).

You can attach here your Confusion Matrix - could be great to see.

Also NN (even from sklearn) may show better results.

avchauzov
  • 1,007
  • 1
  • 8
  • 13
0

Improve your preprocessing.

Methods such as DT and kNN may be sensitive to how you preprocess your columns. For example, a DT can benefit much from well-chosen thresholds on the continuous variables.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194