0

I want to analyse a document for items such as letters, bigrams, words, etc and compare how frequent they are in my document to how frequent they were over a large corpus of documents.

The idea is that words such as "if", "and", "the" are common in all documents but some words will be much more common in this document than is typical for the corpus.

This must be pretty standard. What is it called? Doing it the obvious way I always had a problem with novel words in my document but not in the corpus rating infinitely significant. How is this dealt with?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • can you expand on what metrics of your texts do you need? – matcheek Dec 07 '10 at 02:14
  • @matcheek: Most of the docs I can find are about finding a the document that best matches a search for one or more words, but I'm most interested in finding the "most interesting" words/phrases/ngrams in a document. Something like Amazon's "statistically improbable phrases". – hippietrail Dec 08 '10 at 00:14

2 Answers2

1

It comes under the heading of linear classifiers, with Naive Bayesian classifiers being the most well-known form (due to its remarkably simplicity and robustness in attacking real-world classification problems).

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • I did a lot of reading on "Naive Bayesian classifiers" after reading your answer and found the area fascinating. But I couldn't see the direct connection to my problem which seemed to be better covered by "tf-idf". – hippietrail Apr 30 '11 at 21:40
1

most likely you've already checked the tf-idf or some other metrics from okapi_bm25 family.

also you can check natural language processing toolkit nltk for some ready solutions

UPDATE: as for novel words, smoothing should be applied: Good-Turing, Laplace, etc.

matcheek
  • 4,887
  • 9
  • 42
  • 73
  • I'm accepting your answer because tf-idf covered most of what I was looking for even though I couldn't really work out how your suggestions for smoothing applied but maybe that's because I lacked the terminology to state my question more clearly. – hippietrail Apr 30 '11 at 21:38