A variation of The Greatest Regex Trick Ever: match not just what you want, but as a fall-back, match any four bytes.
def indexes(pattern, bytestr):
for i, match in enumerate(re.findall(b'(%b)|....' % pattern, bytestr)):
if match:
yield i * 4
For your example, re.findall
returns [b'', b'\x01\x02\x03\x04', b'']
. You just need to report the desired matches then and ignore the undesired ones.
This requires that the pattern always matches exactly (a multiple of) four bytes. If that's not the case, maybe you can make it the case by appending .
to your pattern as needed. Or ... I guess I can put the pattern into a look-ahead ... hold on ...
Ok... here's a version that always matches four bytes, but at each position also tries to match your desired pattern in a positive look-ahead (falling back to matching the empty string, so it's guaranteed to match and not disturb the process).
def indexes(pattern, bytestr):
for i, match in enumerate(re.findall(b'(?=(%b)|)....' % pattern, bytestr)):
if match:
yield i * 4
Hmm... it's faster than your original, but not a lot. And only a bit faster than the trivially optimized version of yours. Here are times with a 1 MB bytestr where the pattern occurs 1000 times:
1000 251 ms original
1000 81 ms original_optimized
1000 43 ms Kelly1
1000 44 ms Kelly2
1000 54 ms Kelly3
1000 244 ms original
1000 90 ms original_optimized
1000 41 ms Kelly1
1000 45 ms Kelly2
1000 45 ms Kelly3
1000 264 ms original
1000 87 ms original_optimized
1000 39 ms Kelly1
1000 41 ms Kelly2
1000 42 ms Kelly3
Kelly1 and Kelly2 are the above. Kelly3 is another idea, where I build a str
that prepends each block of four bytes with a "non-byte" character, then uses that to anchor the pattern. That avoids having to filter out false matches like in my other solutions. But it only works if the pattern matches at most four bytes, which the updated question now shows isn't the case. Also, it's not faster, so I didn't fully develop it.
Benchmark code (not cleaned up):
import re
def original(pattern, bytestr):
for ind in range(0, len(bytestr), 4):
if re.match(pattern, bytestr[ind : ind + 4]) is not None:
yield ind
def original_optimized(pattern, bytestr):
match = re.compile(pattern).match
for ind in range(0, len(bytestr), 4):
if match(bytestr[ind : ind + 4]):
yield ind
def Kelly1(pattern, bytestr):
for i, match in enumerate(re.findall(b'(%b)|....' % pattern, bytestr)):
if match:
yield i * 4
def Kelly2(pattern, bytestr):
for i, match in enumerate(re.findall(b'(?=(%b)|)....' % pattern, bytestr)):
if match:
yield i * 4
def Kelly3(pattern, bytestr):
s = bytestr.decode('latin1')
a = len(bytestr) * 5 // 4 * [chr(256)]
for i in range(4):
a[i+1::5] = s[i::4]
s = ''.join(a)
return re.findall(chr(256) + re.escape(pattern.decode('latin1')), s)
funcs = original, original_optimized, Kelly1, Kelly2, Kelly3
pattern = b'\x01\x02\x03\x04'
bytestr = b'\xa1\x02\x03\x04\x01\x02\x03\x04\xb1\x02\x03\x04'
bytestr = (bytestr + bytes(1000)) * 1000
#pattern = b'abc'
#bytestr = b'1234abcd1abc1234' * 2
args = pattern, bytestr
if 0:
print(re.findall(b'(?=(%b)|)....' % pattern, bytestr))
for match in re.finditer(b'(?=(%b)|)....' % pattern, bytestr):
print(match)
from time import time
for _ in range(3):
for f in funcs:
t = time()
print(len(list(f(*args))), f'{round((time() - t) * 1e3):3} ms ', f.__name__)
print()
Attempt This Online!