Counting the most frequent sequence occurring in a string

Question

Full Code is here: https://nekobin.com/yehukomiya (Don't mind the computeFile() function) Hi, I've got a text file with repeating chars in binary string like "0101001001111000010010011110010000000" I already got the KMPSearch() function that I use to count how many times a given sequence occurs in my string. I also count in the ex1() function the char that occurs most in the string, but what I want to achieve is find the most frequent sequence and how many times it occurs, and return the n sequences that repeat the most. e.g. s = "01010010010001000111101100001010011001111000010010011110010000000" n = 20 I want to return the following list:

[     
      (4, ['0001', '0011', '1100' ]),
      (5, ['011', '1000', '110' ]),
      (6, ['0000', '111']),
      (7, ['0010','1001' ]),
      (8, ['0100']),
      (10,['010']),
      (11,['000', '001', '11']),
      (12,['100']),
      (15,['01','10']),
      (23,['00'])
]

Where the firs column is the m-times the sequences occurs, the following column are the actual sequences that repeat themselves m times, and the number of entry in the list is n. It returns all the occurrences because n is less than the number of the occurrences it found. but if n was for example equal to 4 I wanted the list:

[
      (11,['000', '001', '11']),
      (12,['100']),
      (15,['01','10']),
      (23,['00'])
]

At the end I would also like to sort the final tuple as shown. Thank you all for your time and patience.

Are you limited to python standard library or are you allowed to use external packages? — Daweo, Nov 04 '22 at 11:51

QuadU · Accepted Answer · 2022-11-04T15:41:04.000

A rather straightforward, unoptimized way to approach this is in 2 steps could be (explanation below):

from collections import defaultdict

s = "01010010010001000111101100001010011001111000010010011110010000000"

min_seq_length = 2

# 1. Count recurring sequences
sequence_dict = defaultdict(lambda: 0)
for seq_start in range(len(s)):
    for seq_end in range(seq_start + min_seq_length, len(s)+1):
        sequence = s[seq_start:seq_end]
        sequence_dict[sequence] += 1

# 2. Group sequences by number of occurrences
grouped_sequences_dict = defaultdict(list)
for sequence, count in sequence_dict.items():
    if count > 1:
        grouped_sequences_dict[count].append(sequence)

# 3. Sort and filter grouped sequences
sorted_sequences = sorted(list(grouped_sequences_dict.items()))

top_n_results = 4
top_sequences = sorted_sequences[-top_n_results:]

for seq in top_sequences:
    print(seq)

output

(11, ['001', '000', '11'])
(12, ['100'])     
(15, ['01', '10'])
(23, ['00'])

explanation:

Count recurring sequences by iterating over all substrings of at least 2 characters and building a dictionary to keep track how many times every unique sequences occurs:
- sequence_dict's key is a unique sequence (e.g. '00')
- sequence_dict's value is the amount of occurrences
Group sequences by number of occurrences by iterating over all above sequences and building a dictionary that keeps a list of all sequences, per amount of occurrences:
- grouped_sequences_dict's key is the amount of occurrences
- grouped_sequences_dict's value is a list of sequences that occur this amount of times
Sort and filter grouped sequences
- convert above dictionary to a list of (key,value) pairs
- sort it (default sorting of a tuple will be done on the first item of the tuple, which is the key of above dictionary, being the amount of times a sequence occurs
- take the last n (4) items

Could probably be optimized/simplified with a pandas.

[edit] your code

My apologies, I just realized you provided a whole bunch of code already. It looks like you are already covering some aspects. Maybe an additional general remark: try to write the code as pythonic as possible, e.g.:

avoid using meaningless variable names (M,N,i,j etc.)
avoid using while loops and increments when it is possible to use for loops on ranges, or even better for loops on iterables (or even better, list comprehension if applicable)

This will make your code much more readable for yourself and for others

Thanks a lot, but I forgot to tell you, that I can't use libs, only standard python — MattMlgn, Nov 07 '22 at 09:10
The `defaultdict` I'm using can easily be replaced by a regular dictionary, e.g. `sequence_dict = {}`. When counting values, you should check whether or not the key already exists, e.g.: `sequence_dict[sequence] = (sequence_dict[sequence] + 1) if sequence in sequence_dict else 1` — QuadU, Nov 11 '22 at 16:53

Counting the most frequent sequence occurring in a string

1 Answers1

[edit] your code