2

I have a rather large list: 19 million items in memory that I am trying to save to disk (Windows 10 x64 with plenty of space).

pickle.dump(list, open('list.p'.format(file), 'wb')) 

Background: The original data was read in from a csv (2 columns) with the same number of rows (19mil) and was modified to a list of tuples.

The original csv file was 740mb. The file "list.p" is showing up in my directory at 2.5 gb but the python process does not budge (I was debugging and stepping through each line) and the memory utilization at last check was at 19 gb and increasing.

I am just interested if anyone can shed some light on this pickle process.

PS - I understand that pickle.HIGHEST_PROTOCOL is now at Protocol version 4 which was added in Python 3.4. (It adds support for very large objects)

Pouya Barrach-Yousefi
  • 1,207
  • 1
  • 13
  • 27
  • `pickle` builds a program for a simple stack-based virtual machine that is able to reconstruct arbitrarily complex objects (and objects containing objects and so on). There are special protections built into the pickler to avoid pathological structures such as circular lists (which, if not caught, would yield an infinite loop). It is possible that your `list` has a structure that the module doesn't have a guard against. Also, naming variables `list` is a bad idea because it can shadow the built-in function of the same name. I'd need code and data to guess further. – msw Mar 15 '16 at 23:41
  • Assuming your machine has less than 19GB real memory, everything will slow to a crawl when you've got a program frantically swapping blocks in and out in order to complete a computation. But, absent some pathological case, a simple dump should never be generating usages that high. – msw Mar 15 '16 at 23:44

1 Answers1

2

I love the concept of pickle but find it makes for bad, opaque, and fragile backing store. The data are in CSV and I don't see any obvious reason to not leave it in that form.

Testing under Python 3.4 on Linux has yielded timeit results of:

Create dummy two column CSV 19M lines: 17.6s
Read CSV file back in to a persistent list: 8.62s
Pickle dump of list of lists: 21.0s
Pickle load of dump into list of lists: 7.00s

As the mantra goes: until you measure it, your intuitions are useless. Sure, loading the pickle is slightly faster (7.00 < 8.62) but not dramatically. The pickle file is nearly twice as large as the CSV and can only be unpickled. By contrast, every tool can read the CSV including Python. I just don't see the advantage.

For reference, here is my IPython 3.4 test code:

def create_csv(path):
    with open(path, 'w') as outf:
        csvw = csv.writer(outf)
        for i in range(19000000):
            csvw.writerow((i, i*2))

def read_csv(path):
    table = []
    with open(path) as inf:
        csvr = csv.reader(inf)
        for row in csvr:
            table.append(row)
    return table

%timeit create_csv('data.csv')
%timeit read_csv('data.csv')
%timeit pickle.dump(table, open('data.pickle', 'wb'))
%timeit new_table = pickle.load(open('data.pickle', 'rb'))

In case you are unfamiliar, IPython is Python in a nicer shell. I explicitly didn't look at memory utilization because the thrust of this answer (Why use pickle?) renders memory use irrlevant.

msw
  • 42,753
  • 9
  • 87
  • 112
  • Thank you for the thorough response. A couple questions: 1) Any idea what happened to the out of control memory and never returning from the write? 2) If I save my current modified series as a csv, I get rows of MyClass([list1item1, list1item2], [list2Item1]) What would be the best / quickest way to reload this in the future? – Pouya Barrach-Yousefi Mar 15 '16 at 22:08
  • @PouyaYousefi I answered 1) in the comments on your post. For 2) I strongly recommend you open a new question with an [MCVE](http://stackoverflow.com/help/mcve) showing what you've got in code and data and how it's screwing up. Feel free to tag me in these comments if you get a new one up so that I may better notice it. – msw Mar 15 '16 at 23:53