Several times over the years I have wanted to work with frequency lists (character, word, n-gram, etc) of varying quality but never figured out how to use them together.
At the time I intuited that lists with just rank and no other data should be useful. Since then I have learned of Zipf's law and power laws. Though I'm not great at maths so I don't fully understand everything.
I've found some questions in StackOverflow and CrossValidated that seem like they could be related. But I either don't understand them at the right level, or they lack useful answers.
What I want is a way to normalize a list with full frequency data and a list with only rank data so that I can use the two lists together.
For instance a word list with frequency data:
word per /million
的 50155.13
我 50147.83
你 39629.27
是 28253.52
了 28210.53
不 20543.44
在 12811.05
他 11853.78
我们 11080.02
...
...
... 00000.01
And a word list with only rank data:
word rank
的 1
一 2
是 3
有 4
在 5
人 6
不 7
大 8
中 9
...
...
... 100,000
How can I normalize both the frequency data and the rank data into the same kind of value that can then be used in comparisons etc?
(The example lists in this question are just examples. Assume much longer lists obtained from external sources over which the programmer has no control.)