0

My application hangs serializing a lists of dictionaries (CSV data) with pickle. Using the regular Python interpreter there are no issues. I am on Python 2.7, PyPy 2.6.0 for Win32.

Here is the output when I Ctrl+C the application:

Traceback (most recent call last):
  File "<builtin>/app_main.py", line 75, in run_toplevel
  File ".\Da-Lite\dalite_build_script.py", line 167, in <module>
    pickle.dump(data_sheets, fo)
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 1413, in dump
    Pickler(file, protocol).dump(obj)
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 224, in dump
    self.save(obj)
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 653, in save_dict
    self._batch_setitems(obj.iteritems())
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 667, in _batch_setitems
    save(v)
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 653, in save_dict
    self._batch_setitems(obj.iteritems())
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 667, in _batch_setitems
    save(v)
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 615, in _batch_appends
    save(x)
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 653, in save_dict
    self._batch_setitems(obj.iteritems())
  File "H:\Developer\Python\pypy-2.6.0-win32\lib-python\2.7\pickle.py", line 665, in _batch_setitems
    for k, v in items:
KeyboardInterrupt

Using Pickle is not essential to the program, but if there is a relatively simple solution to overcome this problem it makes my life easier.

Peter
  • 95
  • 1
  • 8

2 Answers2

0

The answer is simple here - pickle on pypy is slower, because it's implemeneted in pure python as opposed to C in CPython

fijal
  • 3,190
  • 18
  • 21
  • I get this... Should it be *a lot* slower? What took less than a second taking > 10 minutes? Why implement Pickle at all in PyPy? – Peter Aug 06 '15 at 16:45
  • it definitely should not be 1s -> 10min (more like say 2-5x slower). Can you share the example? – fijal Aug 10 '15 at 22:01
0

If you have a list of dictionaries, a very straightforward thing to do would be to break up the list into several parts, and dump each portion of the list to a different file. Something like this:

>>> d1 = dict(zip(range(10),range(10)))
>>> d2 = dict(zip(range(10,20),range(10,20)))
>>> d3 = dict(zip(range(20,30),range(20,30)))
>>> d4 = dict(zip(range(30,40),range(30,40)))
>>> x = [d1,d2,d3,d4]
>>> fnames = ['a.pik', 'b.pik', 'c.pik', 'd.pik']
>>> 
>>> import pathos
>>> p = pathos.pools.ProcessPool()
>>> 
>>> def dump(data, fname):
...   import dill
...   with open(fname, 'w') as f:
...     dill.dump(data, f)
...   return
... 
>>> r = p.uimap(dump, x, fnames)
>>> # no need to do this, but just FYI, it returns nothing
>>> list(r)
[None, None, None, None]
>>> 

One thing to note, I'm using a fork of multiprocessing called multiprocess that is used by pathos… which provides a multiprocessing map that can take multiple arguments, reduces some of the overhead for starting a map, and has better serialization capabilities than pickle.

I'm using uimap because I don't care about maintaining the order of the return values (they are None). If your dictionaries are big, you may want to use processes… but depending on the size, you might try to use a thread pool, or even imap from itertools. pathos provides them in a nice uniform API so you can switch easily and optimize for execution speed.

>>> pathos.pools.ProcessPool 
<class 'pathos.multiprocessing.ProcessPool'>
>>> pathos.pools.ThreadPool 
<class 'pathos.threading.ThreadPool'>
>>> pathos.pools.SerialPool 
<class 'pathos.serial.SerialPool'>

NOTE: I'm the author of pathos. I know it works in standard python. I can't confirm that it works in PyPy at the moment however. I have had people try it, and have made patches to support PyPy, but I don't test in PyPy… so you'd have to try it and find out. If pathos doesn't work on PyPy… then you'd have to modify the dump function to either take only one argument -- or to make sure you are using itertools.imap. Regardless, the idea of breaking up the list into several chunks, and then serializing the chunks on different processes/threads/whatever is my main point.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139