-5

i need to create a classification model to predict the class of certain event - 1, 2 or 3. I tried two models so far: multiclass decision tree & multiclass neural network. Below is the accuracy score and confusion matrix for each of them.

multiclass decision tree:

Overall accuracy 0.634 Average accuracy 0.756 Micro-averaged precision 0.634 Macro-averaged precision 0.585184 Micro-averaged recall 0.634 Macro-averaged recall 0.548334

confusion matrix:

  1. 40.3% 53.6% 6.1%
  2. 6.6% 76.6% 16.8%
  3. 0.6% 51.8% 47.6%

multiclass neural network: Overall accuracy 0.5865 Average accuracy 0.724333 Micro-averaged precision 0.5865 Macro-averaged precision 0.583795 Micro-averaged recall 0.5865 Macro-averaged recall 0.460215

confusion matrix:

  1. 34.8% 63.5% 1.7%

  2. 2.9% 89.3% 7.7%

  3. 0.1% 85.9% 13.9%

    I think this means on Class2, the two models are doing good, especially the neural network model. On the other classes, the decision tree model is doing better, but still below 50%.

How should I improve the result based on these indicators? Thanks.

Community
  • 1
  • 1
WJ Zhao
  • 61
  • 1
  • 6
  • How many rows are there in the dataset after normalizing? – Kenyi Despean Apr 09 '18 at 02:11
  • Are you using `id` as well for the training? Try removing it if you are using it. For more details look into https://stats.stackexchange.com/questions/224565/overfitting-due-to-a-unique-identifier-among-features – niraj Apr 09 '18 at 02:12
  • how did 40 columns turn into 70? Are you using dummy variables for y as well? – Him Apr 09 '18 at 02:14
  • - 20,000 rows after normalizing – WJ Zhao Apr 09 '18 at 02:20
  • - when i read the csv into python, index_col = 0 - which is the id column. this column doesn't follow any order. – WJ Zhao Apr 09 '18 at 02:21
  • - regarding 40 columns turning into 70: those categorical/object columns have a variety of number of values. for example, feature 10 may have value of [abc, def, opq]. after being converted using get_dummies, feature 10 turned into 3 columns from 1 column. – WJ Zhao Apr 09 '18 at 02:25
  • - i didn't use get_dummies for y (labels) – WJ Zhao Apr 09 '18 at 02:25
  • thanks, 0p3n5ourcE . the article is very helpful. it might be a silly question: if i remove the id column from both datasets (train data and train labels),can the model still be able to recognize which label is for which data row? should i sort the two datasets first? – WJ Zhao Apr 09 '18 at 02:31
  • I don't think so. While training you don't care, you insert features and corresponding labels, it will map corresponding ones so no need to sort (*Assuming fit function in the sklearn*). While you predict, if you send the list of instances, you get list of predicted values to the corresponding instances. You can try checking the tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html I generally remove the id if it is distinct/unique because it does not have predictive effect, isn't it? – niraj Apr 09 '18 at 03:02
  • thanks, 0p3n5ourcE. will definitely do and update here how the result is improved. – WJ Zhao Apr 09 '18 at 12:48

1 Answers1

-1

Remove id feature, also check and remove any features which you think add no value to prediction (any other features like id) or features with unique values. Also check if there is any class imbalance (how many samples of each class are present in data, is there proper balance among the classes?). Then try applying models and tune the parameters for better results. You may use cross-validation for better results.

My3
  • 140
  • 1
  • 10