Comparing frequency data and zipf / rank data

Question

Several times over the years I have wanted to work with frequency lists (character, word, n-gram, etc) of varying quality but never figured out how to use them together.

At the time I intuited that lists with just rank and no other data should be useful. Since then I have learned of Zipf's law and power laws. Though I'm not great at maths so I don't fully understand everything.

I've found some questions in StackOverflow and CrossValidated that seem like they could be related. But I either don't understand them at the right level, or they lack useful answers.

What I want is a way to normalize a list with full frequency data and a list with only rank data so that I can use the two lists together.

For instance a word list with frequency data:

word  per /million
的    50155.13
我    50147.83
你    39629.27
是    28253.52
了    28210.53
不    20543.44
在    12811.05
他    11853.78
我们  11080.02
...
...
...   00000.01

And a word list with only rank data:

word  rank
的    1
一    2
是    3
有    4
在    5
人    6
不    7
大    8
中    9
...
...
...   100,000

How can I normalize both the frequency data and the rank data into the same kind of value that can then be used in comparisons etc?

(The example lists in this question are just examples. Assume much longer lists obtained from external sources over which the programmer has no control.)

Damn maybe it really is so obvious / so simple that I couldn't see it! Unfortunately I don't have an active project of this type right now to see if you're right or if I left something out of the question. — hippietrail, Feb 16 '14 at 11:56
fyi @hippietrail: added some more text and had to turn it into an answer below. — qqilihq, Feb 16 '14 at 11:56

qqilihq · Accepted Answer · 2014-02-16T13:29:06.970

It should be obvious, that you can determine a rank, when you have a complete list with frequencies (order the list by frequency in descending order and assign a rank increment), but not the other way round (how would you know, how often a word occurs, given the information that it is ranked at 3rd position? You can only deduce, that it occurs with equal/lower frequency compared to word at 2nd ranked position, and equal/higher frequency compared to word at 4th position).

Applying Zipf's law, you could map the ranking back to some frequency estimation and try to roughly estimate a frequency. But I'm not sure how well this generalizes for different languages.

[edit] You really caught my attention now :) I came across this application of Zipf's Law on Wolfram MathWorld. I'll do some little experiments with an English term corpus which I created a while ago. I'll come back with results, just a little patience.

[edit2] I now took a frequency list from Word Frequencies in Written and Spoken English: based on the British National Corpus. (this one, to be exact; which only contains the top 5000 words or so, but should be enough for this quick consideration) and applied a simple 1/rank to estimate the frequencies. I did the experiment as a KNIME workflow (using the JFreeChart nodes for the chart and the Palladian nodes [disclaimer: I'm the author of the Palladian nodes] for RMSE calculation), which looks as follows:

KNIME workflow

The graph with the actual frequencies and the estimations from the rank looks as follows (rank is log scaled, sorry for not providing an adequate caption on the axis; blue line is the estimation; red line is the actual value from the dataset):

Frequency estimation

So, while there are some outliers on the higher ranks (e.g. 2,3,4), the frequency estimation should still be perfectly decent when using it in conjunction with TF-IDF or something like that. (RMSE is ~ 0.004 in this case, which is of course due to the minimal deviation in the long tail)

Here's a snippet with some actual values:

Frequency estimation list

Btw.; also have a look at this section on the Wikipedia article about Zipf's law, which shows similar results.

It doesn't have to be for languages. Assume any frequency and ranking lists that are known to obey Zipf's law. — hippietrail, Feb 16 '14 at 11:56
I was thinking something like how can I turn the zipf/rank into a kind of frequency estimation. Intuitively I thought it should be some large floating point number. But maybe the rank is all I needed all along. Perhaps this is still meaningful when a fuzzy comparison is needed? For instance, assume lists with millions or hundreds of thousands of entries. — hippietrail, Feb 16 '14 at 11:59
So I do `1/rank` to the rank to make it usable. What do I do to the per million (for instance) to make it usable in the same way? (Excuse my excessive dumbness.) — hippietrail, Feb 16 '14 at 13:24
In that case, you basically already have the frequencies. To make it comparable to another list, you could normalize the frequencies by dividing each value with the highest value (i.e. the first entry gets `50155.13/50155.13=1`, next gets `50147.83/50155.13` and so on; that's what I also did with the data given in the example above). — qqilihq, Feb 16 '14 at 13:33

Comparing frequency data and zipf / rank data

1 Answers1