0

Imagine, you are a librarian and during time you have classified a bunch of text files (approx 100) with a general ambiguous keyword.

Every text file is actually a topic of keyword_meaning1 or a topic of keyword_meaning2.

Which unsupervised learning approach would you use, to split the text files into two groups?

What precision (in percentage) of correct classification can be achieved according to a number of text files?

Or can be somehow indicated in one group, that there is a need of a librarian to check certain files, because they may be classifed incorrectly?

xralf
  • 3,312
  • 45
  • 129
  • 200

1 Answers1

1

The easiest starting point would be to use a naive Bayes classifier. It's hard to speculate about the expected precision. You have to test it yourself. Just get a program for e-mail spam detection and try it out. For example, SpamBayes (http://spambayes.sourceforge.net/) is a quite good starting point and easily hackable. SpamBayes has a nice feature that it will label messages as "unsure" when there is no clear separation between two classes.

Edit: When you really want unsupervised clustering method, then perhaps something like Carrot2 (http://project.carrot2.org/) is more appropriate.

JooMing
  • 922
  • 2
  • 7
  • 14