If statement based on Rapidminer clustering results

Question

After say, a k-means clustering process is run on a set of points and the result is 5 clusters, is it possible to write to a database based on the majority of points within that separate cluster?

ie. pseudo:

if majority of points within cluster have attribute category == 'state'
add record in database with attribute description == 'state'
else attribute decription == 'private'

Hope my explanation was clear !

It will be possible but to be clear do you mean the following? If there are 100 examples in cluster1 and 51 of these have another attribute called `category` set to `state` then set another attribute called `description` to `state` otherwise set `description` to `private` for all 100 of the examples. Repeat for other clusters taking account of the number for each cluster. Save the final result in a database. — Andrew Chisholm, May 14 '16 at 08:12
Exactly. So the final result to be saved in the database (if for eg. majority are 'state' will be: [centroid of cluster 1] [desc = 'state'] — X'Byte, May 15 '16 at 17:07

score 0 · Answer 1 · answered May 16 '16 at 12:06

A relatively complex process but here's a worked example you can copy.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="7.0.000" expanded="true" height="82" name="Clustering" width="90" x="246" y="34">
        <parameter key="k" value="10"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="136">
        <list key="function_descriptions">
          <parameter key="category" value="if(rand()&gt;0.5, &quot;state&quot;, &quot;notstate&quot;)"/>
          <parameter key="categoryNumeric" value="if(category==&quot;state&quot;, 1, 0)"/>
        </list>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.0.000" expanded="true" height="82" name="Aggregate" width="90" x="246" y="238">
        <list key="aggregation_attributes">
          <parameter key="categoryNumeric" value="average"/>
        </list>
        <parameter key="group_by_attributes" value="cluster"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="380" y="340">
        <list key="function_descriptions">
          <parameter key="description" value="if ([average(categoryNumeric)]&gt;0.5, &quot;state&quot;,&quot;private&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="join" compatibility="7.0.000" expanded="true" height="82" name="Join" width="90" x="514" y="238">
        <parameter key="join_type" value="left"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="cluster" value="cluster"/>
        </list>
      </operator>
      <operator activated="true" class="jdbc_connectors:write_database" compatibility="7.0.000" expanded="true" height="68" name="Write Database" width="90" x="715" y="238">
        <parameter key="connection" value="LocalMYSQL"/>
        <parameter key="schema_name" value="ascom"/>
        <parameter key="table_name" value="joinresult"/>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
      <connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Join" from_port="join" to_op="Write Database" to_port="input"/>
      <connect from_op="Write Database" from_port="through" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

The main points are

Create an attribute corresponding to category called categoryNumeric which is set to 1 if category is state and 0 otherwise.
Aggregate by cluster and take the average of categoryNumeric. If any aggregation value is greater than 0.5, it means the majority of the examples for a cluster have category equal to state.
Create a new attribute in the aggregation result called description based on the majority determination.
Each cluster now has additional data and it can be joined to the original data using the cluster identifier as a key.
Write to a database (I used MySQL)

Hope this helps as a start.

If statement based on Rapidminer clustering results

1 Answers1