I want to cluster pdf documents based on their structure, not only the text content.
The main problem with the text only approach is, that it will loose the information if a document has a pdf form structure or was it just a plain doc or does it contain pictures?
For our further processing these information are most important. My main goal is now to be able to classify a document regarding mainly its structure not only the text content.
The documents to classify are stored in a SQL database as byte[] (varbinary), so my idea is now to use the this raw data for classification, without prior text conversion.
Because if I look at the hex output of these data, I can see repeating structures which seems to be similar to the different doc classes I want to separate. You can see some similar byte patterns as first impression in my attached screenshot.
So my idea is now to train a K-Means model with e.g. a hex output string. In the next step I would try to find the best number of clusters with the elbow method, which should be around 350 - 500.
The size of the pdf data varies between 20 kByte and 5 MB, mostly around 150 kBytes. To train the model I have +30.k documents.
When I research that, the results are sparse. I only find this article, which make me unsure about the best way to solve my task. https://www.ibm.com/support/pages/clustering-binary-data-k-means-should-be-avoided
My questions are:
- Is K-Means the best algorithm for my goal?
- What method do you would recommend?
- How to normalize or transform the data for the best results?