I'm building a decoder for a very non-compliant binary file and I need to search the binary file (specifically, a partial byte buffer, probably going to choose 4kB) for frame headers. This means using an efficient multiple-pattern string search algorithm. I decided aho-corasick. The problem is that some headers are partial bytes, ie 11 bits. These are the approaches I've come up with:
Modify the aho-corasick trie to store children in interval trees rather than hashes. So given the example pattern
0b11111111111
, the first node will store the range[0xff-0xff]
and the second node will store the range[0xe0-0xff]
. Unfortunately this greatly expands the number of suffix links and dictionary suffix links among trie nodes, so I think this solution is equivalent to the following solution:Expand the partial bytes pattern into all possible matching byte patterns:
0b11111111111
->[0xffe0, 0xffe1, ..., 0xfffe, 0xffff]
. Obviously increasing the number of possible patterns greatly increases the search time.Convert the bytes to bits using the
bitarray
module (the individual bits are still backed by bytes in-memory, but now individually accessible). Unfortunately this increases the search space 8-fold, as well as the search pattern lengths.Truncate the pattern to be a multiple of bytes (ie
0b11111111111
->0b11111111
) and then manually check for the three remaining bits using standard integer bit shifting. This greatly increases the number of pattern occurrences in the search space.
Am I crazy to think all these approaches have exactly the same Big-Theta complexity? ... Or is one approach more efficient?
Aho-corasick complexity is O(n + m + z) where n is the length of the search text, m is the total length of all search patterns, and z is the number of matches. Aho-corasick detects overlapping patterns which isn't necessary but can't happen since my patterns do not overlap.
- Use a different string-searching algorithm, Or some feature of python byte comparisons I'm not aware of? Maybe a naive string-search of O(nm) on a 4kB buffer would in implementation be faster than a more efficient algorithm combined with the overhead of partial matches?