Estimated inverted index size

Asked Aug 19 '19 at 06:12

Active Aug 19 '19 at 06:12

Viewed 438 times

Is there a formula to estimate an average or worst-case scenario in building an inverted index on a text document? For example, if we have the following inputs:

File size: 60MB
Number of words: 7M
Number of unique words: ?

If it matters, I'm looking to test this out in python, so the data struct (while in memory) would probably be a dict. How would one go about estimated the index size other than trial and error?

asked Aug 19 '19 at 06:12

David542

104,438
178
489
842

Not sure about what you mean by index size, but for finding number of unique words, you could use a map or trie data structure. – nice_dev Aug 19 '19 at 07:08
do you know your dictionary variable grows exponentially. and it needs more storage than usual lists. – Jainil Patel Aug 19 '19 at 07:30
1

assume worst case as 7M unique words . – Jainil Patel Aug 19 '19 at 07:31
or use nltk library to count . – Jainil Patel Aug 19 '19 at 07:32
To figure this out, you need to know all the details of the specific data structure that will be used. Real implementations use highly compressed structures that end up significantly smaller than the original text. – Matt Timmermans Aug 19 '19 at 12:59
Well, the worst case is easy: 7M unique words. But a program to figure out the number of unique words is trivial. You can probably find an existing Python script to do it. If it's a text file, check out https://stackoverflow.com/questions/15501652/how-split-a-file-in-words-in-unix-command-line. Take the output of that, then `sort -u`, and count the lines. – Jim Mischel Aug 19 '19 at 17:15

Estimated inverted index size

0 Answers0