Search for most occurring patterns in a non language text file

Question

I'm not completely sure this answer belongs here but I'm looking to find patterns into an ascii file.

The file itself is composed of alphanumeric characters and I want to just check for repeating patterns in the file, disregarding of separators and disregarding of natural language words or meaning, just get the most used repeated sequences.

I don't seem to find any program already developed that can do just that (as all seem to work with words, not just sets of characters). Do you know of any application that can do that?

If there's not such an application, how would you recommend I approach at coding one?

Unless there's something else not stated here, this is a programming question quite suitable for SO. You do need to state what you mean by "repeating pattern". Would it be a sequence like "aaaa" or "abcabc", or would it be a string that recurs frequently within the text? In the latter case the problem is merely one of identifying and counting patterns; the biggest coding challenge (which is trivial, really) is to adopt an efficient data structure (such as an appropriate hash table). In the former case the challenge is to recognize "patterns". — , Feb 21 '11 at 14:19
Is the latter case, that is, discover a sequence than recurs frecuently... I'd move it to SO but I'm not sure about how... — Jorge Córdoba, Feb 21 '11 at 14:23
@Jorge I voted to close, not because this is a bad question--it's perfectly fine--but because that initiates the process of migration. Good luck! — , Feb 21 '11 at 14:38

score 1 · Answer 1 · answered Feb 21 '11 at 14:45

1

I'm not aware of any existent program to do it, so I can only recommend coding solution. You will have to build a bit modified Trie with counter of occurrences on its leafs. Then the task becomes trivial: from all leafs find one with the max counter; path from the root to this leaf will be a subsequence (pattern) you searches for.

Also FYI: Longest common substring problem

(I know this question is for SO and my answer must be a comment, but I just haven't enough reputation to leave comments.)

answered Feb 21 '11 at 14:45

ffriend

27,562
13
91
132

I think the question first needs clarification and refinement. After all, if a string *s* appears in the text *n* times, then each character of *s* is guaranteed to appear at least *n* times also--and is likely to occur more often than that. Thus, the problem as currently stated is solved by counting the occurrences of letters, then of digraphs, then of trigraphs, etc., and will (in any realistic case) terminate after the first step and output the single letter that occurs most frequently. – Feb 21 '11 at 16:16
@whuber: agree, but again: due to my low reputation I couldn't leave a comment, so I tried to give the most common answer. – Feb 21 '11 at 16:51
@whuber that's exactly what I'm looking for in fact. Let's say you introduce a minimum lenght and a maximum length and it will give you the frequency for each string s. – Jorge Córdoba Feb 22 '11 at 08:42

score 1 · Accepted Answer · answered Feb 22 '11 at 09:01

1

After some searching I finally found Textanz which analyses the text and gives you a frequency count and a distribution pattern for most repeating substrings.

enter image description here

answered Feb 22 '11 at 09:01

Jorge Córdoba

51,063
11
80
130

Search for most occurring patterns in a non language text file

2 Answers2