Is there a way to do 4-byte-alignment regex search in Python?

Question

When reading data from an area of memory, the result is in the form of bytes string. Usually the data we concern is composed of 4 bytes (being int or float type), and the address of the data is dividable by 4. Now that the address of the data can be found with a regex pattern in Python (using functions like re.findall or re.finditer), is there a way to do this faster by searching for the pattern every 4 bytes?

I know this can be done in softwares like CheatEngine, but I haven't find a way to do it in Python directly. Using a 4-step loop and re.match is rather slow.

Example: find the occurrences of pattern = b'\x01\x02\x03\x04' (in general it's a non-string pattern) in
bytestr = b'\xa1\x02\x03\x04\x01\x02\x03\x04\xb1\x02\x03\x04',
here it appears at index 4. Instead of using re.findall(pattern, bytestr), since we know the pattern could only appears at index 0, 4, 8, ..., we want to accelerate the process with something like

for ind in range(0, len(bytestr), 4):
    if re.match(pattern, bytestr[ind : ind + 4]) is not None:
        print(ind)

But it turns out to be rather slow when bytestr is quite big, even slower than the re.findall(pattern, bytestr). Is there a direct way to achieve this improvement with Python?

============================

The real pattern I used in my program is some thing like

pattern = br"\x93\x5F\x01\x00\x01\x00\x00\x00.....[\x00-\xea]\x00\x00\x60\xea\x00\x00"

@SamMason I just tried, it's significantly slower than `re.findall`. — user498029, Mar 05 '23 at 18:08
How big is "quite big"? I have two ideas, would like to do benchmarking. — Kelly Bundy, Mar 05 '23 at 18:17
@KellyBundy I intend to scan the whole memory of a game process to find the data I want. The memory pages are usually several millions bytes. — user498029, Mar 05 '23 at 18:23
@KellyBundy The pattern is about the HP or coordinates of the champions, so it appears not very often. — user498029, Mar 05 '23 at 18:25
Is the pattern guaranteed to match four bytes (if it does match)? Or can it be *made* to always match four bytes (by appending the necessary number of `.`s)? — Kelly Bundy, Mar 05 '23 at 18:32
@KellyBundy The 4 bytes pattern in the example is oversimplified. The pattern I used is longer with 4n bytes pure string prefix (n = 2, 3, ...) and then some metacharacters. — user498029, Mar 05 '23 at 18:40
Can you tell how long your solution and my solutions and the pure `re.findall(pattern, bytestr)` take in real cases? With that "pure string prefix" you just mentioned, which I'd say is important and belongs into the question, I suspect the pure `findall` will be fastest. — Kelly Bundy, Mar 05 '23 at 19:10
I imagine that both `pattern` and `bytestr` contain a bunch of zeros or similarly common bytes, and we have significantly more than 4 bytes of pattern. In which case, use [Boyer-Moore](https://pypi.org/project/pybmoore) to leapfrog over stretches of bytes: `matches = pybmoore.search(pattern, bytestr)`. If the pattern is _rather_ long, you might be able to skip over whole cache lines. — J_H, Mar 05 '23 at 19:20
Ok, with that pattern you just added, I conclude that I indeed just wasted an hour on a futile task. — Kelly Bundy, Mar 05 '23 at 19:23
@KellyBundy Sorry for not explaining my ideas clearly enough. Thanks anyway. — user498029, Mar 05 '23 at 19:26
Unclear why you said "even slower than the re.findall(pattern, bytestr)". I just tested that, speed was more than a gigabyte per second. Why did you call that slow? — Kelly Bundy, Mar 05 '23 at 19:34
@J_H a good regular expression engine will use tricks like Boyer-Moore or better to be optimally efficient. GNU grep is infamously fast for example. I don't know about Python's engine specifically, but I'd expect it to be top notch and I've never heard any complaints. — Mark Ransom, Mar 06 '23 at 03:26
@MarkRansom Yes, that's why I wouldn't even have tried anything had I known about that prefix. From an (old) [answer](https://stackoverflow.com/a/12815771/12671057): *"optimisation to quickly match patterns prefixed with a string lateral"*. I think I also remember it does a special fast search for a fixed first character. — Kelly Bundy, Mar 06 '23 at 10:26
[Demo](https://ato.pxeger.com/run?1=PY9LCsIwFEXnWcWbJSm1H5AiBSfOXYE4SGyCgeZD-gqKuBInnehS3IO7MVJ1dODCvZdze4QzHr2bpvuIerF6PY0NPiJERXT0FtBYZRD-aVACCRlgDZIKChnUVZY1RPsIAYz7pCeaJ8gZxYydPOxPtCUAmKrWODZPsV5Y2Yk2LRfauE70PQs5DDwHN1qp4rquOIcy3aRuiMYh0_QS_eg61ivHBl5iWauGt8srbDflAOkv8NnmK_WTewM) with a million `a` as text, pattern `bx` runs through with 2400 MB/s while pattern `.x` does 80 MB/s. — Kelly Bundy, Mar 06 '23 at 10:26

Kelly Bundy · Answer 1 · 2023-03-05T21:01:45.247

A variation of The Greatest Regex Trick Ever: match not just what you want, but as a fall-back, match any four bytes.

def indexes(pattern, bytestr):
    for i, match in enumerate(re.findall(b'(%b)|....' % pattern, bytestr)):
        if match:
            yield i * 4

For your example, re.findall returns [b'', b'\x01\x02\x03\x04', b'']. You just need to report the desired matches then and ignore the undesired ones.

This requires that the pattern always matches exactly (a multiple of) four bytes. If that's not the case, maybe you can make it the case by appending . to your pattern as needed. Or ... I guess I can put the pattern into a look-ahead ... hold on ...

Ok... here's a version that always matches four bytes, but at each position also tries to match your desired pattern in a positive look-ahead (falling back to matching the empty string, so it's guaranteed to match and not disturb the process).

def indexes(pattern, bytestr):
    for i, match in enumerate(re.findall(b'(?=(%b)|)....' % pattern, bytestr)):
        if match:
            yield i * 4

Hmm... it's faster than your original, but not a lot. And only a bit faster than the trivially optimized version of yours. Here are times with a 1 MB bytestr where the pattern occurs 1000 times:

1000 251 ms  original
1000  81 ms  original_optimized
1000  43 ms  Kelly1
1000  44 ms  Kelly2
1000  54 ms  Kelly3

1000 244 ms  original
1000  90 ms  original_optimized
1000  41 ms  Kelly1
1000  45 ms  Kelly2
1000  45 ms  Kelly3

1000 264 ms  original
1000  87 ms  original_optimized
1000  39 ms  Kelly1
1000  41 ms  Kelly2
1000  42 ms  Kelly3

Kelly1 and Kelly2 are the above. Kelly3 is another idea, where I build a str that prepends each block of four bytes with a "non-byte" character, then uses that to anchor the pattern. That avoids having to filter out false matches like in my other solutions. But it only works if the pattern matches at most four bytes, which the updated question now shows isn't the case. Also, it's not faster, so I didn't fully develop it.

Benchmark code (not cleaned up):

import re

def original(pattern, bytestr):
    for ind in range(0, len(bytestr), 4):
        if re.match(pattern, bytestr[ind : ind + 4]) is not None:
            yield ind

def original_optimized(pattern, bytestr):
    match = re.compile(pattern).match
    for ind in range(0, len(bytestr), 4):
        if match(bytestr[ind : ind + 4]):
            yield ind

def Kelly1(pattern, bytestr):
    for i, match in enumerate(re.findall(b'(%b)|....' % pattern, bytestr)):
        if match:
            yield i * 4

def Kelly2(pattern, bytestr):
    for i, match in enumerate(re.findall(b'(?=(%b)|)....' % pattern, bytestr)):
        if match:
            yield i * 4

def Kelly3(pattern, bytestr):
    s = bytestr.decode('latin1')
    a = len(bytestr) * 5 // 4 * [chr(256)]
    for i in range(4):
        a[i+1::5] = s[i::4]
    s = ''.join(a)
    return re.findall(chr(256) + re.escape(pattern.decode('latin1')), s)

funcs = original, original_optimized, Kelly1, Kelly2, Kelly3

pattern = b'\x01\x02\x03\x04'
bytestr = b'\xa1\x02\x03\x04\x01\x02\x03\x04\xb1\x02\x03\x04'

bytestr = (bytestr + bytes(1000)) * 1000

#pattern = b'abc'
#bytestr = b'1234abcd1abc1234' * 2
args = pattern, bytestr

if 0:
  print(re.findall(b'(?=(%b)|)....' % pattern, bytestr))
  for match in re.finditer(b'(?=(%b)|)....' % pattern, bytestr):
    print(match)

from time import time
for _ in range(3):
  for f in funcs:
    t = time()
    print(len(list(f(*args))), f'{round((time() - t) * 1e3):3} ms ', f.__name__)
  print()

Attempt This Online!

From your code, is `Kelly1` and `Kelly2` destined not to be faster than `re.findall(pattern, bytestr)`? — user498029, Mar 05 '23 at 19:08
@user498029 Depends on the pattern. I'm confident I can show patterns and bytestrs where mine are faster, but with the prefix you just mentioned under the question, I suspect `re.findall(pattern, bytestr)` might be the fastest way. — Kelly Bundy, Mar 05 '23 at 19:14

ktm5124 · Answer 2 · 2023-03-05T18:40:26.937

0

We can use this algorithm to improve the performance of the search.

for i in range(0, len(bytestr), 4):
    if pattern == bytestr[i:i+4]
        print("Location of pattern: %d" %i)

This accomplishes a couple things.

First of all, it removes the overhead generated by the re.match function call. (We save CPU cycles and memory by using the == operator.) [1]

Second of all, it is more efficient than re.findall because we are limiting our search to locations in the string that are a multiple of 4 from the origin.

(The re.findall function checks every index in the string for a match, whereas the code above only checks indices that are a multiple of 4.)

Another way to make it more efficient is to write the code in C/C++.

Footnotes:

[1] If we have a very long byte string and we call re.match at every multiple-of-4 index, then we end up calling re.match many times, and this generates a lot of overhead.

edited Mar 05 '23 at 18:40

answered Mar 05 '23 at 18:09

ktm5124

11,861
21
74
119

2

I agree that calling `re.match` N times is awful, but the OP's goal is to have a regex they can call only _once_, making the overhead moot. A good NFA-based regex engine is surprisingly fast, though of course the question here is whether the language exposed by Python is sufficiently expressive (and, as a further complication, Python's `re` module historically _isn't_ NFA-based but was for a long time a PCRE-inspired backtracking abomination; not sure if that's still true today, but back when it certainly was true, `re2` and other alternative regex implementations for Python did exist). – Charles Duffy Mar 05 '23 at 18:28
@CharlesDuffy More importantly, they wrote *"in general it's a non-string pattern"*, so this answer is just wrong. – Kelly Bundy Mar 05 '23 at 19:19

Is there a way to do 4-byte-alignment regex search in Python?

2 Answers2