-1

I have a classification problem where I have a set of blocks which form my data points. One of the attributes that I can use for block classification is a tag, which essentially is a block number of another block. The blocks also have other attributes (size) which can be used for classification. The "tag" attribute in my data set can be used for classification as follows - If 2 blocks have 2 tags (block numbers) that belong to the same cluster, the blocks or data points should be clustered together. Here, I do not know beforehand what cluster number the tag number will be beforehand.

Block 1 [Tag 4] size 10
Block 2 [Tag 3] size 20
Block 3 [Tag 1] size 100
Block 4 [Tag 2] size 110

Here, based on the Tag attribute, Block 1 and Block 2 tag Block 3 and 4 respectively. also, block 3 and block 4 tag block 2 and block 1 respectively. Hence, Block 1, Block 2 can belong to cluster id 1, and block 3 and 4 can belong to cluster id 2. also, the size of blocks 1,2 are more similar than sizes of blocks 3,4. the end result of classification should be

cluster id 1: Block 1 , Block 2
cluster id 2: Block 3 , Block 4

Is there a way to classify such data points? As I understand, a Naive Bayes Classifier considers each attribute to be independent of each other. Here, the attribute (tag) is dependent on a future event (the cluster id in which the tagged block number will belong). What form/class of clustering algorithms should I look for to solve this problem? One approach that I can think of is running k-means using other attributes such as size, and then when I approximately know the cluster ids, I add this cluster id to tags and use that as an attribute for classification. Are there alternative better approaches to write classifiers where attributes depend on resultant clusters themselves? Any help would be appreciated.

Shehbaz Jaffer
  • 1,944
  • 8
  • 23
  • 30
  • 1
    Are the clusters and classification labels the same? Would you please clarify (e.g. with an example)? – Ash Jul 04 '16 at 03:55

1 Answers1

0

This objective does not make sense.

Your four blocks and tags form a cycle:

1 -> 4 -> 2 -> 3 -> 1

Why would it make sense to break this into two groups, 1+2 and 3+4?

k-means and other algorithms will not be of much help here. You need to find some formal property of what is a good solution; then find an algorithm to optimize this property. k-means minimizes sqaured deviations - how is this going to help your problem?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • This is a sample example, coincidentally it forms a cycle. view it as a bipartite graph with (1,2) in 1 cluster and (3,4) in another with edges from 1 cluster to another. k-means is useful because the size attribute can be used to classify blocks into cluster. the tag attribute, however is a future event which cannot be used beforehand. currently I first classify blocks into clusters using size, and then use tags based on which cluster they belong to.This gives me decent results but I want more precise results, since my tag attribute is most distingishable attribute in my dataset. – Shehbaz Jaffer Jul 09 '16 at 02:02