0

I have my data set which are my rules, and I want to generate a decision tree that at least has 100% accuracy at classifying my rules, but I can never get 100%. I set minNumObjs to 1 and made it unpruned but I only get 84% correctly classified instances.

My attributes are:

@attribute users numeric
@attribute bandwidth numeric
@attribute latency numeric
@attribute mode {C,H,DCF,MP,DC,IND}

ex data:

2,200000,0,C
2,200000,1000,C
2,200000,2000,MP
2,200000,5000,C
2,400000,0,C
2,400000,1000,DCF

Can someone help me understand why I can never get 100% of my instances classified and how I can get 100% of them classified (while still allowing my attributes to be numeric)

Thanks

jmasterx
  • 52,639
  • 96
  • 311
  • 557

1 Answers1

2

It is sometimes impossible to get 100% accuracy due to identical feature vectors having different labels. I am guessing in your case that users, bandwidth, and latency are the features, while mode is the label that you are trying to predict. If so, then there may be identical values of {users, bandwidth, latency} that happen to have different mode labels.

In general, having different labels for the same features may occur through one of several ways:

  1. There is noise in the data due to a bad reading of the data.
  2. There is a source of randomness that is not captured.
  3. There are more possible features that can distinguish between different labels, but the features are not in your data set.

One thing you can do now is to run your training set through the decision tree and find the items that were misclassified. Try to determine why they are wrong and see if any data instances exhibit what I wrote above (namely that there are some data instances with the same features but different labels).

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • All the features are unique. The ones it is getting wrong are outlines, for example, we might have bandwidth increasing at a rate of 100, 100 to 1000, and all of them are mode C and at 600 it is mode DC. Those are the ones it is getting wrong. It's the corner cases. – jmasterx Apr 21 '16 at 22:54
  • I tried a Best First tree and it managed to classify 96% of them, but the tree was ugly. – jmasterx Apr 21 '16 at 22:55
  • You may also apply feature scaling so that the numeric features are on the same scale (e.g. between 0.0 and 1.0). Two approaches are "z-score scaling" (aka standardization) and "max - min scaling". Wikipedia has a very clear explanation: https://en.wikipedia.org/wiki/Feature_scaling . As a matter of fact, Weka has this capability built-in: http://stackoverflow.com/questions/20904071/how-to-use-different-scaling-approaches-in-weka – stackoverflowuser2010 Apr 21 '16 at 23:27