2

I'm building a decoder for a very non-compliant binary file and I need to search the binary file (specifically, a partial byte buffer, probably going to choose 4kB) for frame headers. This means using an efficient multiple-pattern string search algorithm. I decided aho-corasick. The problem is that some headers are partial bytes, ie 11 bits. These are the approaches I've come up with:

  1. Modify the aho-corasick trie to store children in interval trees rather than hashes. So given the example pattern 0b11111111111, the first node will store the range [0xff-0xff] and the second node will store the range [0xe0-0xff]. Unfortunately this greatly expands the number of suffix links and dictionary suffix links among trie nodes, so I think this solution is equivalent to the following solution:

  2. Expand the partial bytes pattern into all possible matching byte patterns: 0b11111111111 -> [0xffe0, 0xffe1, ..., 0xfffe, 0xffff]. Obviously increasing the number of possible patterns greatly increases the search time.

  3. Convert the bytes to bits using the bitarray module (the individual bits are still backed by bytes in-memory, but now individually accessible). Unfortunately this increases the search space 8-fold, as well as the search pattern lengths.

  4. Truncate the pattern to be a multiple of bytes (ie 0b11111111111 -> 0b11111111) and then manually check for the three remaining bits using standard integer bit shifting. This greatly increases the number of pattern occurrences in the search space.

Am I crazy to think all these approaches have exactly the same Big-Theta complexity? ... Or is one approach more efficient?

Aho-corasick complexity is O(n + m + z) where n is the length of the search text, m is the total length of all search patterns, and z is the number of matches. Aho-corasick detects overlapping patterns which isn't necessary but can't happen since my patterns do not overlap.

  1. Use a different string-searching algorithm, Or some feature of python byte comparisons I'm not aware of? Maybe a naive string-search of O(nm) on a 4kB buffer would in implementation be faster than a more efficient algorithm combined with the overhead of partial matches?
user19087
  • 1,899
  • 1
  • 16
  • 21
  • Why don't you just implement some of these methods (for instance, naive search and converting bit to bytes as they look pretty straightforward) and measure their performance? If it's good enough, there's no need to make things complicated. – kraskevich Aug 17 '17 at 18:26
  • That's a good idea, I thought, and you'd think aho-corasick would work with any sequence whose elements are comparable, and if using hashes rather than trees, hashable. But none of the python implementations work with bitstrings, list[bool], or bitarrays, or anything but strings and sometimes also bytes. – user19087 Aug 17 '17 at 21:54
  • If the data is quite small (4K doesn't sound too much), you can convert a bit array to a string of 0's and 1's. – kraskevich Aug 18 '17 at 07:31
  • There is a book, "Flexible Pattern Matching in Strings" by Navarro and Raffinot, which explains how to extend common algorithms to wildcard string searches. – CoronA Jan 22 '18 at 05:58

0 Answers0