Split text files into two groups - unsupervised learning

Question

Imagine, you are a librarian and during time you have classified a bunch of text files (approx 100) with a general ambiguous keyword.

Every text file is actually a topic of keyword_meaning1 or a topic of keyword_meaning2.

Which unsupervised learning approach would you use, to split the text files into two groups?

What precision (in percentage) of correct classification can be achieved according to a number of text files?

Or can be somehow indicated in one group, that there is a need of a librarian to check certain files, because they may be classifed incorrectly?

JooMing · Accepted Answer · 2017-05-18T20:39:38.250

1

The easiest starting point would be to use a naive Bayes classifier. It's hard to speculate about the expected precision. You have to test it yourself. Just get a program for e-mail spam detection and try it out. For example, SpamBayes (http://spambayes.sourceforge.net/) is a quite good starting point and easily hackable. SpamBayes has a nice feature that it will label messages as "unsure" when there is no clear separation between two classes.

Edit: When you really want unsupervised clustering method, then perhaps something like Carrot2 (http://project.carrot2.org/) is more appropriate.

edited May 18 '17 at 20:39

answered May 18 '17 at 18:58

JooMing

922
2
7
14

Isn't Bayes method supervised? With a training phase? Or this is some modification? – xralf May 18 '17 at 19:46
Yes, it's supervised. I noticed from the problem statement that you have class labels available, so the naive Bayes seemed the most straightforward approach. – JooMing May 18 '17 at 20:38
Thanks for carrot2. – xralf May 21 '17 at 07:30

Split text files into two groups - unsupervised learning

1 Answers1