4

I want to find the points in a binary file that have specific bytes. For example say I want to check all instances in my file that start with the two bytes:

AB C3

And end with the two bytes:

AB C4

Right now I am doing

    while True:
        byte = file.read(1)
         if not byte:
           break
         if ord(byte) == 171:

But then how will I continue the loop so that once I find the first AB- I will see that consecutively the next byte is C3. And then once I find C3 how will I read in bytes to loop through until the sequence AB C4 (if it exists) without messing up my overall loop structure.

I am running into difficulty because I am not sure how to approach python's read and seek functions. Should I keep a pointer to seek back to when I find the sequences? Is there a simple way to do what I am trying to do in python that I am just unaware of?

Thanks.

J. Doe
  • 85
  • 1
  • 6

2 Answers2

0

Presuming you can read the entire file into memory:

import re
import operator

with open(filename, 'rb') as file:
    bytes = file.read()

matches = [(i.start(),i.end())
            for i in re.finditer(b'\xab\xc3*\xab\xc3', bytes)]

Each tuple in matches contains a start index and stop index (using slice notation where the stop index is one index position after the final c3 byte). The slices are all non-overlapping.

If you want all the overlapping matches' indices, you'd need to transform matches along the lines of:

overlapping = [(start, stop) 
                  for start in map(operator.itemgetter(0), matches)
                  for stop in map(operator.itemgetter(1), matches)
                      if start < stop]
Matthew Cole
  • 602
  • 5
  • 21
0

Well, if you cannot afford to read the whole file into memory, you can accomplish this by iterating through the bytes. I've used a deque as an auxillary data structure, taking advantage of the maxlen parameter to scan every consecutive pair of bytes. To let me use a for-loop instead of an error-prone while-loop, I use the two-argument iter to iterate over the file byte-by-byte a. E.g iter(iterable, sentinal) First let's build a test-case:

>>> import io, functools
>>> import random
>>> some_bytes = bytearray([random.randint(0, 255) for _ in range(12)] + [171, 195] + [88, 42, 88, 42, 88, 42] + [171, 196]+[200, 211, 141])
>>> some_bytes
bytearray(b'\x80\xc4\x8b\x86i\x88\xba\x8a\x8b\x07\x9en\xab\xc3X*X*X*\xab\xc4\xc8\xd3\x8d')
>>>

And now, the some perliminaries:

>>> from collections import deque
>>> start = deque([b'\xab', b'\xc3'])
>>> stop = deque([b'\xab', b'\xc4'])
>>> current = deque(maxlen=2)
>>> target = []
>>> inside = False

Let's pretend we are reading from a file:

>>> f = io.BytesIO(some_bytes)

Now, create the handy byte-by-byte iterable:

>>> read_byte = functools.partial(f.read, 1)

And now we can loop a lot easier:

>>> for b in iter(read_byte, b''):
...     current.append(b)
...     if not inside and current == start:
...         inside = True
...         continue
...     if inside and current == stop:
...         break
...     if inside:
...         target.append(b)
...
>>> target
[b'X', b'*', b'X', b'*', b'X', b'*', b'\xab']
>>>

You'll notice this leave the first value of the "end" in there. It is straightforward to clean up, though. Here is a more fleshed-out example, where there are several "runs" of bytes between the delimiters:

>>> some_bytes = some_bytes * 3
>>> start = deque([b'\xab', b'\xc3'])
>>> stop = deque([b'\xab', b'\xc4'])
>>> current = deque(maxlen=2)
>>> targets = []
>>> target = []
>>> inside = False
>>> f = io.BytesIO(some_bytes)
>>> read_byte = functools.partial(f.read, 1)
>>> for b in iter(read_byte, b''):
...     current.append(b)
...     if not inside and current == start:
...         inside = True
...         continue
...     if inside and current == stop:
...         inside = False
...         target.pop()
...         targets.append(target)
...         target = []
...     if inside:
...         target.append(b)
...
b'\xab'
b'\xab'
b'\xab'
>>> targets
[[b'X', b'*', b'X', b'*', b'X', b'*'], [b'X', b'*', b'X', b'*', b'X', b'*'], [b'X', b'*', b'X', b'*', b'X', b'*']]
>>>

This approach will be slower than reading the file into memory and using re, but it will be memory efficient. There might be some edge-cases to take care of that I haven't thought of, but I think it should be straightforward to extend the above approach. Also, if there is a "start" byte sequence with no corresponding "stop", the target list will keep growing until the file is exhausted.

Finally, perhaps the best way is to read the file in manageable chunks, and process those chunks using the logic below. This combines space and time efficiency. In pseudo-pseudo code:

chunksize = 1024
start = deque([b'\xab', b'\xc3'])
stop = deque([b'\xab', b'\xc4'])
current = deque(maxlen=2)
targets = []
target = []
inside = False
read_chunk = functools.partial(f.read, chunksize)

for bytes_chunk in iter(read_chunk, b''):
    for b in bytes_chunk:
        < same logic as above >
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172