Is there a formula to estimate an average or worst-case scenario in building an inverted index on a text document? For example, if we have the following inputs:
- File size: 60MB
- Number of words: 7M
- Number of unique words: ?
If it matters, I'm looking to test this out in python, so the data struct (while in memory) would probably be a dict
. How would one go about estimated the index size other than trial and error?