1

For an assignment I have to calculate the performance of an ID3 tree with data given with the training data, explain why it's bad to do so with the training data and find a way to calculate the performance without the training data.

the proccess on rapidminer

with this I get a performance of 100% which I assume is wrong. Even if it isn't I have no idea where to go from here. Any help?

1 Answers1

1

Your problem is, that you use the same data for training and testing.
What you want to is split the data into a training and a test data set. Then you train your ID§ tree on the train set and apply that tree on the test set and calculate the performance on that result.

The easiest way to do this is the Split Data operator, where you can set the ratio of the split (typically something like 0.7 for training and 0.3 for testing). The more robust approach for validating the performance of a model is to use a Cross Validation.

enter image description here

Here is also the process XML file, just copy&paste it into your RapidMiner process view:

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
  <operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Sonar"/>
  </operator>
  <operator activated="true" class="numerical_to_polynominal" compatibility="8.2.000" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="179" y="85">
    <parameter key="include_special_attributes" value="true"/>
  </operator>
  <operator activated="true" class="split_data" compatibility="8.2.000" expanded="true" height="103" name="Split Data" width="90" x="380" y="85">
    <enumeration key="partitions">
      <parameter key="ratio" value="0.7"/>
      <parameter key="ratio" value="0.3"/>
    </enumeration>
    <parameter key="sampling_type" value="shuffled sampling"/>
  </operator>
  <operator activated="true" class="id3" compatibility="8.2.000" expanded="true" height="82" name="ID3" width="90" x="581" y="85"/>
  <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="648" y="238">
    <list key="application_parameters"/>
  </operator>
  <operator activated="true" class="performance_classification" compatibility="8.2.000" expanded="true" height="82" name="Performance" width="90" x="782" y="238">
    <list key="class_weights"/>
  </operator>
  <connect from_op="Retrieve Sonar" from_port="output" to_op="Numerical to Polynominal" to_port="example set input"/>
  <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Split Data" to_port="example set"/>
  <connect from_op="Split Data" from_port="partition 1" to_op="ID3" to_port="training set"/>
  <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
  <connect from_op="ID3" from_port="model" to_op="Apply Model" to_port="model"/>
  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
  <connect from_op="Performance" from_port="performance" to_port="result 1"/>
  <portSpacing port="source_input 1" spacing="0"/>
  <portSpacing port="sink_result 1" spacing="0"/>
  <portSpacing port="sink_result 2" spacing="0"/>
  </process>
 </operator>
</process>
David
  • 792
  • 5
  • 17