0

Is there a formula to estimate an average or worst-case scenario in building an inverted index on a text document? For example, if we have the following inputs:

  • File size: 60MB
  • Number of words: 7M
  • Number of unique words: ?

If it matters, I'm looking to test this out in python, so the data struct (while in memory) would probably be a dict. How would one go about estimated the index size other than trial and error?

David542
  • 104,438
  • 178
  • 489
  • 842
  • Not sure about what you mean by index size, but for finding number of unique words, you could use a map or trie data structure. – nice_dev Aug 19 '19 at 07:08
  • do you know your dictionary variable grows exponentially. and it needs more storage than usual lists. – Jainil Patel Aug 19 '19 at 07:30
  • 1
    assume worst case as 7M unique words . – Jainil Patel Aug 19 '19 at 07:31
  • or use nltk library to count . – Jainil Patel Aug 19 '19 at 07:32
  • To figure this out, you need to know all the details of the specific data structure that will be used. Real implementations use highly compressed structures that end up significantly smaller than the original text. – Matt Timmermans Aug 19 '19 at 12:59
  • Well, the worst case is easy: 7M unique words. But a program to figure out the number of unique words is trivial. You can probably find an existing Python script to do it. If it's a text file, check out https://stackoverflow.com/questions/15501652/how-split-a-file-in-words-in-unix-command-line. Take the output of that, then `sort -u`, and count the lines. – Jim Mischel Aug 19 '19 at 17:15

0 Answers0