In my case I have a dataset of letters and symbols, detected in an image. The detected items are represented by their coordinates, type (letter, number etc), value, orientation and not the actual bounding box of the image. My goal is, using this dataset, to group them into different "words" or contextual groups in general.
So far I achieved ok-ish results by applying classic unsupervised clustering, using DBSCAN algorithm, but still this is way tοo limited on the geometric distance of the samples and so the resulting groups cannot resemble the "words" I am aiming for. So I am searching for a way to influence the results of the clustering algorithm by using the knowledge I have about the "word-like" nature of the clusters needed.
My possible approach that I thought was to create a dataset of true and false clusters and train an SVM model (or any classifier) to detect whether a proposed cluster is correct or not. But still for this, I have no solid proof that I can train a model well enough to discriminate between good and bad clusters, plus I find it difficult to efficiently and consistently represent the clusters, based on the features of their members. Moreover, since my "testing data" will be a big amount of all possible combinations of the letters and symbols I have, the whole approach seems a bit too complicated to attempt implementing it without any proof or indications that it's going to work in the end.
To conclude, my question is, if someone has any prior experience with that kind of task (in my mind sounds rather simple task, but apparently it is not). Do you know of any supervised clustering algorithm and if so, which is the proper way to represent clusters of data so that you can efficiently train a model with them?
Any idea/suggestion or even hint towards where I can research about it will be much appreciated.