Comparing text frequencies in a document to frequency in a corpus

Question

I want to analyse a document for items such as letters, bigrams, words, etc and compare how frequent they are in my document to how frequent they were over a large corpus of documents.

The idea is that words such as "if", "and", "the" are common in all documents but some words will be much more common in this document than is typical for the corpus.

This must be pretty standard. What is it called? Doing it the obvious way I always had a problem with novel words in my document but not in the corpus rating infinitely significant. How is this dealt with?

@matcheek: Most of the docs I can find are about finding a the document that best matches a search for one or more words, but I'm most interested in finding the "most interesting" words/phrases/ngrams in a document. Something like Amazon's "statistically improbable phrases". — hippietrail, Dec 08 '10 at 00:14

score 1 · Answer 1 · answered Dec 07 '10 at 01:55

1

It comes under the heading of linear classifiers, with Naive Bayesian classifiers being the most well-known form (due to its remarkably simplicity and robustness in attacking real-world classification problems).

answered Dec 07 '10 at 01:55

Marcelo Cantos

181,030
38
327
365

I did a lot of reading on "Naive Bayesian classifiers" after reading your answer and found the area fascinating. But I couldn't see the direct connection to my problem which seemed to be better covered by "tf-idf". – hippietrail Apr 30 '11 at 21:40

matcheek · Accepted Answer · 2010-12-07T02:20:42.010

1

most likely you've already checked the tf-idf or some other metrics from okapi_bm25 family.

also you can check natural language processing toolkit nltk for some ready solutions

UPDATE: as for novel words, smoothing should be applied: Good-Turing, Laplace, etc.

edited Dec 07 '10 at 02:20

answered Dec 07 '10 at 02:02

matcheek

4,887
9
42
73

I'm accepting your answer because tf-idf covered most of what I was looking for even though I couldn't really work out how your suggestions for smoothing applied but maybe that's because I lacked the terminology to state my question more clearly. – hippietrail Apr 30 '11 at 21:38

Comparing text frequencies in a document to frequency in a corpus

2 Answers2