Well, if you cannot afford to read the whole file into memory, you can accomplish this by iterating through the bytes. I've used a deque
as an auxillary data structure, taking advantage of the maxlen
parameter to scan every consecutive pair of bytes. To let me use a for-loop instead of an error-prone while-loop, I use the two-argument iter
to iterate over the file byte-by-byte a. E.g iter(iterable, sentinal)
First let's build a test-case:
>>> import io, functools
>>> import random
>>> some_bytes = bytearray([random.randint(0, 255) for _ in range(12)] + [171, 195] + [88, 42, 88, 42, 88, 42] + [171, 196]+[200, 211, 141])
>>> some_bytes
bytearray(b'\x80\xc4\x8b\x86i\x88\xba\x8a\x8b\x07\x9en\xab\xc3X*X*X*\xab\xc4\xc8\xd3\x8d')
>>>
And now, the some perliminaries:
>>> from collections import deque
>>> start = deque([b'\xab', b'\xc3'])
>>> stop = deque([b'\xab', b'\xc4'])
>>> current = deque(maxlen=2)
>>> target = []
>>> inside = False
Let's pretend we are reading from a file:
>>> f = io.BytesIO(some_bytes)
Now, create the handy byte-by-byte iterable:
>>> read_byte = functools.partial(f.read, 1)
And now we can loop a lot easier:
>>> for b in iter(read_byte, b''):
... current.append(b)
... if not inside and current == start:
... inside = True
... continue
... if inside and current == stop:
... break
... if inside:
... target.append(b)
...
>>> target
[b'X', b'*', b'X', b'*', b'X', b'*', b'\xab']
>>>
You'll notice this leave the first value of the "end" in there. It is straightforward to clean up, though. Here is a more fleshed-out example, where there are several "runs" of bytes between the delimiters:
>>> some_bytes = some_bytes * 3
>>> start = deque([b'\xab', b'\xc3'])
>>> stop = deque([b'\xab', b'\xc4'])
>>> current = deque(maxlen=2)
>>> targets = []
>>> target = []
>>> inside = False
>>> f = io.BytesIO(some_bytes)
>>> read_byte = functools.partial(f.read, 1)
>>> for b in iter(read_byte, b''):
... current.append(b)
... if not inside and current == start:
... inside = True
... continue
... if inside and current == stop:
... inside = False
... target.pop()
... targets.append(target)
... target = []
... if inside:
... target.append(b)
...
b'\xab'
b'\xab'
b'\xab'
>>> targets
[[b'X', b'*', b'X', b'*', b'X', b'*'], [b'X', b'*', b'X', b'*', b'X', b'*'], [b'X', b'*', b'X', b'*', b'X', b'*']]
>>>
This approach will be slower than reading the file into memory and using re
, but it will be memory efficient. There might be some edge-cases to take care of that I haven't thought of, but I think it should be straightforward to extend the above approach. Also, if there is a "start" byte sequence with no corresponding "stop", the target
list will keep growing until the file is exhausted.
Finally, perhaps the best way is to read the file in manageable chunks, and process those chunks using the logic below. This combines space and time efficiency. In pseudo-pseudo code:
chunksize = 1024
start = deque([b'\xab', b'\xc3'])
stop = deque([b'\xab', b'\xc4'])
current = deque(maxlen=2)
targets = []
target = []
inside = False
read_chunk = functools.partial(f.read, chunksize)
for bytes_chunk in iter(read_chunk, b''):
for b in bytes_chunk:
< same logic as above >