0

After say, a k-means clustering process is run on a set of points and the result is 5 clusters, is it possible to write to a database based on the majority of points within that separate cluster?

ie. pseudo:

if majority of points within cluster have attribute category == 'state'
add record in database with attribute description == 'state'
else attribute decription == 'private'

Hope my explanation was clear !

X'Byte
  • 155
  • 1
  • 12
  • It will be possible but to be clear do you mean the following? If there are 100 examples in cluster1 and 51 of these have another attribute called `category` set to `state` then set another attribute called `description` to `state` otherwise set `description` to `private` for all 100 of the examples. Repeat for other clusters taking account of the number for each cluster. Save the final result in a database. – Andrew Chisholm May 14 '16 at 08:12
  • Exactly. So the final result to be saved in the database (if for eg. majority are 'state' will be: [centroid of cluster 1] [desc = 'state'] – X'Byte May 15 '16 at 17:07

1 Answers1

0

A relatively complex process but here's a worked example you can copy.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.0.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="k_means" compatibility="7.0.000" expanded="true" height="82" name="Clustering" width="90" x="246" y="34">
        <parameter key="k" value="10"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="136">
        <list key="function_descriptions">
          <parameter key="category" value="if(rand()&gt;0.5, &quot;state&quot;, &quot;notstate&quot;)"/>
          <parameter key="categoryNumeric" value="if(category==&quot;state&quot;, 1, 0)"/>
        </list>
      </operator>
      <operator activated="true" class="aggregate" compatibility="7.0.000" expanded="true" height="82" name="Aggregate" width="90" x="246" y="238">
        <list key="aggregation_attributes">
          <parameter key="categoryNumeric" value="average"/>
        </list>
        <parameter key="group_by_attributes" value="cluster"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes (4)" width="90" x="380" y="340">
        <list key="function_descriptions">
          <parameter key="description" value="if ([average(categoryNumeric)]&gt;0.5, &quot;state&quot;,&quot;private&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="join" compatibility="7.0.000" expanded="true" height="82" name="Join" width="90" x="514" y="238">
        <parameter key="join_type" value="left"/>
        <parameter key="use_id_attribute_as_key" value="false"/>
        <list key="key_attributes">
          <parameter key="cluster" value="cluster"/>
        </list>
      </operator>
      <operator activated="true" class="jdbc_connectors:write_database" compatibility="7.0.000" expanded="true" height="68" name="Write Database" width="90" x="715" y="238">
        <parameter key="connection" value="LocalMYSQL"/>
        <parameter key="schema_name" value="ascom"/>
        <parameter key="table_name" value="joinresult"/>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Generate Attributes (4)" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
      <connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Join" to_port="right"/>
      <connect from_op="Join" from_port="join" to_op="Write Database" to_port="input"/>
      <connect from_op="Write Database" from_port="through" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

The main points are

  • Create an attribute corresponding to category called categoryNumeric which is set to 1 if category is state and 0 otherwise.
  • Aggregate by cluster and take the average of categoryNumeric. If any aggregation value is greater than 0.5, it means the majority of the examples for a cluster have category equal to state.
  • Create a new attribute in the aggregation result called description based on the majority determination.
  • Each cluster now has additional data and it can be joined to the original data using the cluster identifier as a key.
  • Write to a database (I used MySQL)

Hope this helps as a start.

Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41