10

to give you context:

I have a large file f, several Gigs in size. It contains consecutive pickles of different object that were generated by running

for obj in objs: cPickle.dump(obj, f)

I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize) for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.

Another option that I have in mind is to dumps() the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines() and loads(). But I fear that a pickled object may have the "\n" character and it will throw off this file reading scheme. Is my fear unfounded ?

One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.

EDIT: I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.

san
  • 4,144
  • 6
  • 32
  • 50
  • 1
    There should already be buffering happening behind the scenes. Also threads won't help. – Winston Ewert Apr 01 '11 at 03:58
  • If you have control over the creation of the pickle file, you could probably make it a lot smaller by using `pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)` (or the equivalent `pickle.dump(obj, f, -1)`) which is a binary protocol and much more compact than the default ASCII one you're getting. Having a much smaller file might mitigate concerns about buffering. In fact it would likely mean that the "looking for lines ending in `'.\n'`" trick in @Kirk Strauser's [answer](http://stackoverflow.com/a/5507750/355230) wouldn't work. – martineau Jan 11 '15 at 16:26

4 Answers4

7

You don't need to do anything, i think.

>>> import pickle
>>> import StringIO
>>> s = StringIO.StringIO(pickle.dumps('apples') + pickle.dumps('bananas'))
>>> pickle.load(s)
'apples'
>>> pickle.load(s)
'bananas'
>>> pickle.load(s)

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    pickle.load(s)
  File "C:\Python26\lib\pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "C:\Python26\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python26\lib\pickle.py", line 880, in load_eof
    raise EOFError
EOFError
>>> 
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
5

file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:

import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
    line = infile.readline()
    if not line:
        break
    buf.append(line)
    if line.endswith('.\n'):
        print 'Decoding', buf
        print pickle.loads(''.join(buf))
        buf = []

If you have any control over the program that generates the pickles, I'd pick one of:

  1. Use the shelve module.
  2. Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.
  3. Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.
  4. Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.

By the way, I suspect that the file's built-in buffering should get you 99% of the performance gains you're looking for.

If you're convinced that I/O is blocking you, have you thought about trying mmap() and letting the OS handle packing in blocks at a time?

#!/usr/bin/env python

import mmap
import cPickle

fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
    end = m.find('.\n', start + 1) + 2
    if end == 1:
        break
    print cPickle.loads(m[start:end])
    start = end
Kirk Strauser
  • 30,189
  • 5
  • 49
  • 65
  • This is very similar to the solution I had, except that I considered using readllines with a prespecified limit on the number of bytes to read in. Am I guaranteed that the pickles will never have a newline inside. That is the critical bit that I want to know, otherwise ours is a buggy solution. Did i read you correctly that individual pickles end with ".\n", in other words a dot followed by a newline ? Thanks for the other suggestions. Right now I keep invoking cPickle.load() on a buffered file object, and cann read some 200 objects/sec each around 800 bytes on avg, thats still slow. – san Apr 01 '11 at 00:21
  • I'm no expert, but looking at a database table I have with a few thousand pickles in it, all of them end with '.\n'. Look at `pickle.dumps('\n')`; it seems to escape '\n', although I don't know if that's guaranteed. What do you think about #4, where you pickle many items in a list then unpack them from a single file.read()? – Kirk Strauser Apr 01 '11 at 00:38
  • Yes #4 is nice. Only drawback is that the consumer wil be constrained to that fixed 'buffer'. But if what you say about ".\n" is correct the problem is already solved :) – san Apr 01 '11 at 00:48
2

You might want to look at the shelve module. It uses a database module such as dbm to create an on-disk dictionary of objects. The objects themselves are still serialized using pickle. That way you could read sets of objects instead of one big pickle at a time.

Kamil Kisiel
  • 19,723
  • 11
  • 46
  • 56
  • Thanks for your reply. I cannot see how this will help in more efficient buffering though. What I am looking for is a way to read in a large chunk of bytes and unpickle some `K` objects at a time. The idea is to alleviate the i/o bottleneck. I believe what you are suggesting is that I try to load a number of objects by supplying many keys. The module then looks up the values corresponding to those keys. Isnt that a lot of extrawork, and hence a step backwards, given that I do not care about the order of the objects, just that I want o go over all of them. But I could have got you wrong. – san Mar 31 '11 at 23:45
2

If you want to add buffering to any file, open it via io.open(). Here is an example which will read from the underlying stream in 128K chunks. Each call to cPickle.load() will be fulfilled from the internal buffer until it is exhausted, then another chunk will be read from the underlying file:

import cPickle
import io

buf = io.open('objects.pkl', 'rb', buffering=(128 * 1024))
obj = cPickle.load(buf)
samplebias
  • 37,113
  • 6
  • 107
  • 103
  • Thats exactly what I do, in fact I read in a gzip compressed (compression level 2) file using a buffered interface, but I was expecting to grab as many pickled objects as I can, efficiently. So the feedback that I am getting is that without threading this is mostly how fast this will ever be. – san Apr 01 '11 at 00:31