Is there any supervised clustering algorithm or a way to apply prior knowledge to your clustering?

Question

In my case I have a dataset of letters and symbols, detected in an image. The detected items are represented by their coordinates, type (letter, number etc), value, orientation and not the actual bounding box of the image. My goal is, using this dataset, to group them into different "words" or contextual groups in general.

So far I achieved ok-ish results by applying classic unsupervised clustering, using DBSCAN algorithm, but still this is way tοo limited on the geometric distance of the samples and so the resulting groups cannot resemble the "words" I am aiming for. So I am searching for a way to influence the results of the clustering algorithm by using the knowledge I have about the "word-like" nature of the clusters needed.

My possible approach that I thought was to create a dataset of true and false clusters and train an SVM model (or any classifier) to detect whether a proposed cluster is correct or not. But still for this, I have no solid proof that I can train a model well enough to discriminate between good and bad clusters, plus I find it difficult to efficiently and consistently represent the clusters, based on the features of their members. Moreover, since my "testing data" will be a big amount of all possible combinations of the letters and symbols I have, the whole approach seems a bit too complicated to attempt implementing it without any proof or indications that it's going to work in the end.

To conclude, my question is, if someone has any prior experience with that kind of task (in my mind sounds rather simple task, but apparently it is not). Do you know of any supervised clustering algorithm and if so, which is the proper way to represent clusters of data so that you can efficiently train a model with them?

Any idea/suggestion or even hint towards where I can research about it will be much appreciated.

score 2 · Answer 1 · answered Mar 16 '21 at 00:21

There are papers on supervised clustering. A nice, clear one is Eick et al., which is available for free. Unfortunately, I do not think any off-the-shelf libraries in python support this. There is also this in the specific realm of text, but it is a much more domain-specific approach compared to Eick.

But there is a very simple solution that is effectively a type of supervised clustering. Decision Trees essentially chop feature space into regions of high-purity, or at least attempt to. So you can do this as a quick type of supervised clustering:

Create a Decision Tree using the label data.
Think of each leaf as a "cluster."

In sklearn, you can retrieve the leaves of a Decision Tree by using the apply() method.

score 0 · Answer 2 · answered Nov 28 '19 at 19:15

0

A standard approach would be to use the dendrogram.

Then merge branches only if they agree with your positive examples and don't violate any of your negative examples.

answered Nov 28 '19 at 19:15

Has QUIT--Anony-Mousse

76,138
12
138
194

Is there any supervised clustering algorithm or a way to apply prior knowledge to your clustering?

2 Answers2