8

In Python I have a file stream, and I want to copy some part of it into a StringIO. I want this to be fastest as possible, with minimum copy.

But if I do:

data = file.read(SIZE)
stream = StringIO(data)

I think 2 copies was done, no? One copy into data from file, another copy inside StringIO into internal buffer. Can I avoid one of the copies? I don't need temporary data, so I think one copy should be enough

zaharpopov
  • 16,882
  • 23
  • 75
  • 93

4 Answers4

8

In short: you can't avoid 2 copies using StringIO.

Some assumptions:

  • You're using cStringIO, otherwise it would be silly to optimize this much.
  • It's speed and not memory efficiency you're after. If not, see Jakob Bowyer's solution, or use a variant using file.read(SOME_BYTE_COUNT) if your file is binary.
  • You've already stated this in the comments, but for completeness: you want to actually edit the contents, not just view it.

Long answer: Since python strings are immutable and the StringIO buffer is not, a copy will have to be made sooner or later; otherwise you'd be altering an immutable object! For what you want to be possible, the StringIO object would need to have a dedicated method that read directly from a file object given as an argument. There is no such method.

Outside of StringIO, there are solutions that avoid the extra copy. Off the top of my head, this will read a file directly into a modifiable byte array, no extra copy:

import numpy as np
a = np.fromfile("filename.ext", dtype="uint8")

It may be cumbersome to work with, depending on the usage you intend, since it's an array of values from 0 to 255, not an array of characters. But it's functionally equivalent to a StringIO object, and using np.fromstring, np.tostring, np.tofile and slicing notation should get you where you want. You might also need np.insert, np.delete and np.append.

I'm sure there are other modules that will do similar things.

TIMEIT:

How much does all this really matter? Well, let's see. I've made a 100MB file, largefile.bin. Then I read in the file using both methods and change the first byte.

$ python -m timeit -s "import numpy as np" "a = np.fromfile('largefile.bin', 'uint8'); a[0] = 1"
10 loops, best of 3: 132 msec per loop
$ python -m timeit -s "from cStringIO import StringIO" "a = StringIO(); a.write(open('largefile.bin').read()); a.seek(0); a.write('1')"
10 loops, best of 3: 203 msec per loop

So in my case, using StringIO is 50% slower than using numpy.

Lastly, for comparison, editing the file directly:

$ python -m timeit "a = open('largefile.bin', 'r+b'); a.seek(0); a.write('1')"
10000 loops, best of 3: 29.5 usec per loop

So, it's nearly 4500 times faster. Of course, it's extremely dependent on what you're going to do with the file. Altering the first byte is hardly representative. But using this method, you do have a head start on the other two, and since most OS's have good buffering of disks, the speed may be very good too.

(If you're not allowed to edit the file and so want to avoid the cost of making a working copy, there are a couple of possible ways to increase the speed. If you can choose the filesystem, Btrfs has a copy-on-write file copy operation -- making the act of taking a copy of a file virtually instant. The same effect can be achieved using an LVM snapshot of any filesystem.)

Lauritz V. Thaulow
  • 49,139
  • 12
  • 73
  • 92
  • is there means without numpy, i.e. in stdlib? maybe bytearray for same effect? – zaharpopov Nov 23 '11 at 13:08
  • Not that I know of, no. Bytearray seems not to accept file objects as an argument. – Lauritz V. Thaulow Nov 23 '11 at 13:25
  • that sounds like a shame, so the only way to read modifiable buffer from file fast is with numpy :( – zaharpopov Nov 23 '11 at 13:30
  • The fastest of all would of course be to directly edit the file (or a copy) using the file object methods `seek`, `read` and `write` -- the same interface as StringIO. The disk cache will help speed things up too. Is there some reason you want it in-memory? – Lauritz V. Thaulow Nov 23 '11 at 13:41
  • if there's a lot of processing to do on the data, I'm not sure it would be faster to keep it in file. Also, I don't want to modify contents of file itself (but eventually generate some reports to another file) – zaharpopov Nov 24 '11 at 16:18
6

No, there is not an extra copy made. The buffer used to store the data is the same. Both data and the internal attribute accessible using StringIO.getvalue() are different names for the same data.

Python 2.7 (r27:82500, Jul 30 2010, 07:39:35) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import StringIO
>>> data = open("/dev/zero").read(1024)
>>> hex(id(data))
'0xea516f0'
>>> stream = StringIO.StringIO(data)
>>> hex(id(stream.getvalue()))
'0xea516f0'

A quick skim through the source shows that cStringIO doesn't make a copy on construction either, but it does make a copy on calling cStringIO.getvalue(), so I can't repeat the above demonstration.

Michael Hoffman
  • 32,526
  • 7
  • 64
  • 86
  • Since the content of `data` is immutable and the content of `stream` is not, the extra copy is bound to be made as soon as the StringIO object is modified, if not before. The question remains. – Lauritz V. Thaulow Nov 23 '11 at 11:10
  • That's a different question. If you want to know how StringIO works, the best thing to do is read `StringIO.py`. – Michael Hoffman Nov 23 '11 at 11:21
  • @MichaelHoffman: thank you, but I'm specifically interested in that copy made when modification is done. I know that StringIO does it, my question is how to avoid id. How to read data directly from file to modifiable StringIO? – zaharpopov Nov 23 '11 at 12:27
2

Maybe what you're looking for is a buffer/memoryview:

>>> data = file.read(SIZE)
>>> buf = buffer(data, 0, len(data))

This way you can access a slice of the original data without copying it. However, you must be interested in accessing that data only in byte oriented format since that's what the buffer protocol provides.

You can find more information in this related question.

Edit: In this blog post I found through reddit, some more information is given regarding the same problem:

>>> f = open.(filename, 'rb')
>>> data = bytearray(os.path.getsize(filename))
>>> f.readinto(data)

According to the author no extra copy is created and data can be modified since bytearray is mutable.

Community
  • 1
  • 1
jcollado
  • 39,419
  • 8
  • 102
  • 133
  • It depends on the object being accessed. In the [memoryview](http://docs.python.org/library/stdtypes.html#memoryview) documentation there's an example that changes a value in a `bytearray` object (without changing its size). However, in your example, `file.read` will return an inmutable stringinmutable, so you won't be able to do that on that object. – jcollado Nov 23 '11 at 13:37
  • I've just seen [this](http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews/) in reddit and it seems to solve the issue to get the data into a `bytearray` using `file.readinto`. – jcollado Nov 28 '11 at 18:26
0
stream = StringIO()
for line in file:
    stream.write(line + "\n")
Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91