I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs
with columns doc_id
and doc_content
.
I know that there is approximately 8 types of documents. Here are my questions:
- How can I find these groups?
- Are there some similarity, dissimilarity measures I can use?
- Is there some implementation of longest common substring in PostgreSQL?
- Are there some extensions for text mining in PostgreSQL? (I've found only Tsearch, but this seems to be last updated in 2007)
I can probably use some like '%%'
or SIMILAR TO
, but there might be better approach.