-1

I want to cluster pdf documents based on their structure, not only the text content.

The main problem with the text only approach is, that it will loose the information if a document has a pdf form structure or was it just a plain doc or does it contain pictures?

For our further processing these information are most important. My main goal is now to be able to classify a document regarding mainly its structure not only the text content.

The documents to classify are stored in a SQL database as byte[] (varbinary), so my idea is now to use the this raw data for classification, without prior text conversion.

Because if I look at the hex output of these data, I can see repeating structures which seems to be similar to the different doc classes I want to separate. You can see some similar byte patterns as first impression in my attached screenshot.

So my idea is now to train a K-Means model with e.g. a hex output string. In the next step I would try to find the best number of clusters with the elbow method, which should be around 350 - 500.

The size of the pdf data varies between 20 kByte and 5 MB, mostly around 150 kBytes. To train the model I have +30.k documents.

When I research that, the results are sparse. I only find this article, which make me unsure about the best way to solve my task. https://www.ibm.com/support/pages/clustering-binary-data-k-means-should-be-avoided

My questions are:

  • Is K-Means the best algorithm for my goal?
  • What method do you would recommend?
  • How to normalize or transform the data for the best results?

Screenshot raw data

Pixel Hunter
  • 92
  • 10
  • 1
    Using raw binary data to classify something that is a well-structured graph (tree) sounds like a bad idea. Why can't you parse it as a PDF, extract the document structure, create features from that and then KNN or other algorithm over them? You are throwing away so much valuable information by trying to work from the binary. – Ian Mercer Feb 10 '21 at 23:03
  • See for example: https://stackoverflow.com/questions/4422129/clustering-tree-structured-data or https://intellipaat.com/community/2549/clustering-tree-structured-data – Ian Mercer Feb 10 '21 at 23:04

1 Answers1

0

Like Ian in the comments said, to use raw data seems a bad idea.

With further research I found the best solution to first read the structure of the PDF file e.g. with an approach like this:

https://github.com/Uzi-Granot/PdfFileAnaylyzer

I normalized and clustered the data with this information, which gives me good results.

Pixel Hunter
  • 92
  • 10