Pattern discovery in raw data

Question

I am looking to construct an algorithm for discovering repeating patterns in raw data (non-ASCII).

The shortest and largest pattern sizes to be configurable. The size of the data to search over would be in the tens of thousands of bytes.

For example, given the following data:

AB CD 01 AB CD 02 EF 03 02 EF 04 02 EF

Would output the number of times the repeating patterns would be encountered. In this case:

ABCD x2
02EF x3

I have looked at several algorithms such as suffix trees, but generally seem to be string-based.

This will be written in Python, but I'm more interested in the concepts involved rather than an actual implementation.

Many thanks for your help.

Read up on compression algorithms. Many are based on just this idea. Your note about "string-based" algorithms makes little sense. There is no reason whatsoever why any "string-based" algorithm wouldn't work with your data as is, or with trivial changes. — n. m. could be an AI, Mar 16 '13 at 21:26
+1 for being "more interested in the concepts involved rather than an actual implementation". — Bernhard Barker, Mar 16 '13 at 21:53

Anders Forsgren · Accepted Answer · 2013-03-17T12:03:44.897

I'd go for an algorithm like Lempel-Ziv-Welch

The internal dictionary of the algorithm will hold the pattern strings, and the output (i.e. compressed data) will represent the locations of those substrings. Obtaining the counts from the data is trivial, and the algorithm is fairly easy to implement as well.

Note that "string" in a data compression context does not imply text. Binary data is just using an alphabet of 256 different byte values.

Pattern discovery in raw data

1 Answers1