Cluster text documents in database

Question

I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs with columns doc_id and doc_content.

I know that there is approximately 8 types of documents. Here are my questions:

How can I find these groups?
Are there some similarity, dissimilarity measures I can use?
Is there some implementation of longest common substring in PostgreSQL?
Are there some extensions for text mining in PostgreSQL? (I've found only Tsearch, but this seems to be last updated in 2007)

I can probably use some like '%%' or SIMILAR TO, but there might be better approach.

score 1 · Accepted Answer · answered Apr 04 '13 at 08:13

1

You should use full text search, which is part of PostgreSQL 9.x core (aka Tsearch2).

For some kind of measure of longest common substring (or similarity if you will), you might be able to use levenshtein() function - part of fuzzystrmatch extension.

answered Apr 04 '13 at 08:13

mvp

111,019
13
122
148

score 0 · Answer 2 · answered Apr 06 '13 at 17:27

You can use a clustering technique such as K-Means or Hierarchical Clustering.
Yes you can use the Cosine similarity between documents, looking at the binary term count, term counts, term frequencies, or TF-IDF frequencies.
I don't know about that one.
Not sure, but you could use R or RapidMiner to do the data mining against your database.

Cluster text documents in database

2 Answers2