8

Suppose I have some StringIO (from cStringIO). I want to read buffer from it until some character/byte is encountered, say 'Z', so:

stringio = StringIO('ABCZ123')
buf = read_until(stringio, 'Z')  # buf is now 'ABCZ'
# strinio.tell() is now 4, pointing after 'Z'

What is fastest way to do this in Python? Thank you

zaharpopov
  • 16,882
  • 23
  • 75
  • 93

3 Answers3

7

I very disappointed that this question get only one answer on stack overflow, because it is interesting and relevant question. Anyway, since only ovgolovin give solution and I thinked it is maybe slow, I thought a faster solution:

def foo(stringio):
    datalist = []
    while True:
        chunk = stringio.read(256)
        i = chunk.find('Z')
        if i == -1:
            datalist.append(chunk)
        else:
            datalist.append(chunk[:i+1])
            break
        if len(chunk) < 256:
            break
    return ''.join(datalist)

This read io in chunks (maybe end char found not in first chunk). It is very fast because no Python function called for each character, but on the contrary maximal usage of C-written Python functions.

This run about 60x faster than ovgolovin's solution. I ran timeit to check it.

zaharpopov
  • 16,882
  • 23
  • 75
  • 93
  • Very good solution! It addresses the Python's heavy overhead on function calls. The only downside is that you keep in memory a redundant `datalist` object. It's possible to rewrite this code with generator instead of function (`join` accepts iterators), so there will be no temporary redundant objects in memory. – ovgolovin Nov 27 '11 at 10:52
  • But generator version turns out to be a bit slower: http://ideone.com/dQGe5 (If a string is big (1 mln symbols) - then the generator version is a bit faster). – ovgolovin Nov 27 '11 at 10:58
  • By the way, why have you chosen `256` symbol chunks? (why not `512` or `1024`?) – ovgolovin Nov 27 '11 at 10:59
  • And I hope the last point. It's not Pythonic to write `chink.find('Z')`. It can be rewritten as `if 'Z' in chunk: ...` – ovgolovin Nov 27 '11 at 11:05
  • No, that wasn't the last point. `chunk[:i]` should be `chunk[:i+1]` (because we need to include `Z`). – ovgolovin Nov 27 '11 at 11:26
  • Also, if there is no `Z` in the string, `while True` will crate an infinite loop without breaking out. – ovgolovin Nov 27 '11 at 11:41
  • 1
    You're missing a `stringio.seek` at the end to put the current position back to right after the `Z`. – Baffe Boyois Nov 27 '11 at 11:58
  • @ovgolovin: re 256, i think it should be typical expected string length (a bit longer), 256 is just arbitrary now – zaharpopov Nov 27 '11 at 12:18
  • @ovgolovin: re [:i+1] thanks, right. re not pythonic chuck.find, but I also need index later, so I can't use `in` (i dont want run search two times). And also fixed exit condition if no 'Z' found - good catch – zaharpopov Nov 27 '11 at 12:19
  • @BaffeBoyois: yes, you right. I don't care about stream position afterwards, but it should be easy to add (since I have length of chunk vs. i) – zaharpopov Nov 27 '11 at 12:20
  • It may be more Pythonic to use `for chunk in iter(lambda: stringio_loc.read(256),''):` instead of `while True: chunk = stringio.read(256)`. It also addresses the problem of breaking out of loop when the end is reached (the second argument of `iter` is responsible for that: when the stringio is exhausted it begins returning an empty string). – ovgolovin Nov 27 '11 at 12:29
2
#!/usr/bin/env python3
import io


def iterate_stream(stream, delimiter, max_read_size=1024):
    """ Reads `delimiter` separated strings or bytes from `stream`. """
    empty = '' if isinstance(delimiter, str) else b''
    chunks = []
    while 1:
        d = stream.read(max_read_size)
        if not d:
            break
        while d:
            i = d.find(delimiter)
            if i < 0:
                chunks.append(d)
                break
            chunks.append(d[:i+1])
            d = d[i+1:]
            yield empty.join(chunks)
            chunks = []
    s = empty.join(chunks)
    if s:
        yield s


if __name__ == '__main__':
    print(next(iterate_stream(io.StringIO('ABCZ123'), 'Z')))
    print(next(iterate_stream(io.BytesIO(b'ABCZ123'), b'Z')))
pasztorpisti
  • 3,760
  • 1
  • 16
  • 26
2
i = iter(lambda: stringio.read(1),'Z')
buf = ''.join(i) + 'Z'

Here iter is used in this mode: iter(callable, sentinel) -> iterator.

''.join(...) is quite effective. The last operation of adding 'Z' ''.join(i) + 'Z' is not that good. But it can be addressed by adding 'Z' to the iterator:

from itertools import chain, repeat

stringio = StringIO.StringIO('ABCZ123')
i = iter(lambda: stringio.read(1),'Z')
i = chain(i,repeat('Z',1))
buf = ''.join(i)

One more way to do it is to use generator:

def take_until_included(stringio):
    while True:
        s = stringio.read(1)
        yield s
        if s=='Z':
            return

i = take_until_included(stringio)
buf = ''.join(i)

I made some efficiency tests. The performance of the described techniques is pretty the same:

http://ideone.com/dQGe5

ovgolovin
  • 13,063
  • 6
  • 47
  • 78