I am having a python's pickled object which generates a 180 Mb file. When I unpickle it, the memory usage explode to 2 or 3Gb. Do you have similar experience? Is it normal?
The object is a tree containing a dictionary : each edge is a letter, and each node is a potential word. So to store a word you need as much edges as the length of this word. So, the first level is 26 node maximum, the second one is 26^2, the third 26^3, etc... For each node being a word I have an attribute pointing toward the informations about the word (verb, noun, definition, etc...).
I have words of about 40 characters maximum. I have around half a million entry. Everything goes fine till I pickle (using a simple cpickle dump) : it gives a 180 Mb file. I am on Mac OS, and when I unpickle these 180 Mb, the OS give 2 or 3 Gb of "memory / virtual memory" to the python process :(
I don't see any recursion on this tree : the edges have nodes having themselves an array of array. No recursion involved.
I am a bit stuck : the loading of these 180 Mb is around 20 sec (not speaking about the memory issue). I have to say my CPU is not that fast : core i5, 1.3Ghz. But my hard drive is an ssd. I only have 4Gb of memory.
To add these 500 000 word in my tree, I read about 7 000 files containing each one about 100 words. Making this reading make the memory allocated by mac os going up to 15 Gb, mainly on virtual memory :( I have been using the "with" statement ensuring the closing of each file, but doesn't really help. Reading a file take around 0.2 sec for 40 Ko. Seems quite long to me. Adding it to the tree is much faster (0.002 sec).
Finally I wanted to make an object database, but I guess python is not suitable to that. Maybe I will go for a MongoDB :(
class Trie():
"""
Class to store known entities / word / verbs...
"""
longest_word = -1
nb_entree = 0
def __init__(self):
self.children = {}
self.isWord = False
self.infos =[]
def add(self, orthographe, entree):
"""
Store a string with the given type and definition in the Trie structure.
"""
if len(orthographe) >Trie.longest_word:
Trie.longest_word = len(orthographe)
if len(orthographe)==0:
self.isWord = True
self.infos.append(entree)
Trie.nb_entree += 1
return True
car = orthographe[0]
if car not in self.children.keys():
self.children[car] = Trie()
self.children[car].add(orthographe[1:], entree)