0

I have a file of size 500MB if I store each line of that file in a dictionary setup like

file = "my_file.csv"
with open(file) as f:
    for l in f:
        delimiter = ','
        line = l.split(delimiter)
        hash_key = delimiter.join(line[:4])
        store_line = delimiter.join(line[4:])
        store_dict[hash_key] = store_line

To check my memory, I compared the memory usage of my program by watching htop, first with the above, then switching the last line to

print(hash_key + ":" + store_line) 

And that took < 100MB of memory.

the size of my store_dict is approximately 1.5GB in memory. I have checked for memory leaks, I can't find any. Removing this line store_dict[hash_key] = store_line results in the program taking < 100MB of memory. Why does this take up so much memory? Is there anyway to store the lines as a dictionary and not have it take up so much memory?

jo2083248
  • 103
  • 1
  • 2
  • 7
  • "*the size of my store_dict is approximately 1.5GB in memory*" - How did you come to know that? What tool did you use to measure it? – Robᵩ Feb 28 '18 at 17:09
  • @Robᵩ tracked total memory of my python program, changing `store_dict[hash_key] = store_line` to `print(hash_key +":"+store_line)` – jo2083248 Feb 28 '18 at 17:13
  • @Robᵩ Using the first line, resulted in my program taking 1.5GB of memory, the second line taking < 100MB of memory. – jo2083248 Feb 28 '18 at 17:13
  • 2
    When you remove that line `store_dict[hash_key] = store_line`, you don't store anything in the dictionary, so naturally no memory is required for that. – mkrieger1 Feb 28 '18 at 17:14
  • https://stackoverflow.com/questions/23660717/python-dictionary-loaded-from-disk-takes-too-much-space-in-memory?rq=1 – David Zemens Feb 28 '18 at 17:15
  • @mkrieger1 Right, so I am wondering why a dictionary takes 1.5GB of memory to store a 500MB text file of lines. (Where the key is 1/2 the line and the value that key points to, is the other half of the line) so no additional data is being stored – jo2083248 Feb 28 '18 at 17:16
  • Also possible: " if you work with Unicode strings, each Unicode character use 2 or 4 bytes in memory. Whereas on your file, assuming UTF-8 encoding, most of the characters use only 1 byte" https://stackoverflow.com/questions/17313381/why-does-a-dictionary-use-so-much-ram-in-python?rq=1 – David Zemens Feb 28 '18 at 17:16
  • 1
    Possible duplicate of [Python: Reducing memory usage of dictionary](https://stackoverflow.com/questions/10264874/python-reducing-memory-usage-of-dictionary) – mkrieger1 Feb 28 '18 at 17:18
  • @DavidZemens I am decoding each line before I add it to the dictionary `l.decode('utf-8')` . Also, for your other suggestion, I am making millions of calls to this dictionary every few minutes, so disk storage isn't really practical. (No, these calls don't have memory leaks, I removed all calls and simply filled the dictionary and nothing else and the memory issue still persists – jo2083248 Feb 28 '18 at 17:19
  • BTW you can't have memory leaks in your Python code as you're not doing any memory management yourself. Don't worry about that. – mkrieger1 Feb 28 '18 at 17:20
  • @jo2083248 you might consider providing a [mcve] of your code. You should also edit your question to include details on *how* you determine the memory usage. Simply saying "I tracked memory usage" isn't adequate. – David Zemens Feb 28 '18 at 17:21
  • Basic question, are you using python 3.6? Under 3.6, "The dict type has been reimplemented to use a more compact representation." – jpp Feb 28 '18 at 17:24
  • @jpp I am using 3.6 – jo2083248 Feb 28 '18 at 17:26
  • @DavidZemens I expanded my example to include what you asked, thanks – jo2083248 Feb 28 '18 at 17:26
  • The next question I have: is there repeated data in your dictionary values? Reason I ask is if there is a lot of repeated data, you can factorise and potentially save a lot of space. – jpp Feb 28 '18 at 17:31
  • Unfortunately not, it's all unique customer data. Also, unfortunately, I believe dictionary is my best choice for data structure because I need to perform some data validation and using a dictionary allows for `O(1)` runtime of those after they are built. So, I guess I am stuck between large memory consumption or longer runtime @jpp – jo2083248 Feb 28 '18 at 17:36

1 Answers1

2

Even if the store_line strs each took up the same amount of memory as the corresponding piece of text in the file on the disk (which they properly don't, especially if you are using Python 3 where strs default to Unicode), the dict necessarily takes up way more space than your file. The dict does not only contain the bare text, but a lot of Python objects.

Each dict key and value is a str, which each carries not just text information, but also their own lengths and reference counting used for garbage collection. The dict itself also need to store meta data about its items, such as the hash of each key and a pointer to each value.

If you had a few, very long lines in the file, then you should expect the Python representation to have comparable memory consumption. That is, if you are sure that the file uses the same encoding as Python...

jmd_dk
  • 12,125
  • 9
  • 63
  • 94