I have a small program that uses NLTK to get the frequency distribution of a rather large dataset. The problem is that after a few million words I start to eat up all the RAM on my system. Here's what I believe to be the relevant lines of code:
freq_distribution = nltk.FreqDist(filtered_words) # get the frequency distribution of all the words
top_words = freq_distribution.keys()[:10] # get the top used words
bottom_words = freq_distribution.keys()[-10:] # get the least used words
There must be a way to write the key, value store to disk, I'm just not sure how. I'm trying to stay away from a document store like MongoDB and stay purely pythonic. If anyone has some suggestions I would appreciate it.