2

I have a long list of sub-strings (close to 16000) that I want to find where the repeating cycle starts/stops. I have come up with this code as a starting point:

strings= ['1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',
        '1001000000110110',
        '0010000001101101',
        '1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',]

pat = [ '1100100100000010',
        '1001001000000110',
        '0010010000001100',]

for i in range(0,len(strings)-1):
    for j in range(0,len(pat)):
        if strings[i] == pat[j]:
            continue
        if strings[i+1] == pat[j]:
            print 'match', strings[i]
            break
        break

The problem with this method is that you have to know what pat is to search for it. I would like to be able to start with the first n sub-list (in this case 3) and search for them, if not match move down one sub-string to the next 3 until it has gone through the entire list or finds the repeat. I believe if the length is high enough (maybe 10) it will find the repeat without being too time demanding.

paperstsoap
  • 335
  • 4
  • 13

3 Answers3

1
strings= ['1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',
        '1001000000110110',
        '0010000001101101',
        '1100100100000010',
        '1001001000000110',
        '0010010000001100',
        '0100100000011011',]

n = 3

patt_dict = {}

for i in range(0, len(strings) - n, 1):
    patt = (' '.join(strings[i:i + n]))
    if patt not in patt_dict.keys(): patt_dict[patt] = 1
    else: patt_dict[patt] += 1


for key in patt_dict.keys():
    if patt_dict[key] > 1: 
        print 'Found ' + str(patt_dict[key]) + ' repeating instances of ' + str(key) + '.'

Give this a shot. Runs in linear time. Basically uses a dictionary to count the number of times that an n-size pattern occurs in a subset. If it exceeds 1, then we have a repeating pattern :)

LogCapy
  • 447
  • 7
  • 20
  • So if I understand this correctly, if I increase n it will give me the largest repeating series of sequences that can be found? – paperstsoap Jul 19 '16 at 19:22
  • This particular function will give you the total number of times that the n-length sequence appears in "strings." patt_dict keeps track of that for you. – LogCapy Jul 19 '16 at 19:24
  • Awesome, I'm playing around with it now. – paperstsoap Jul 19 '16 at 19:25
  • This does what I'm looking for. After putting it in a for loop to get all the combinations, I can then mess around with it to get what I need. Thank you – paperstsoap Jul 19 '16 at 20:55
0

Here's something that will find all subarrays that match within the strings array.

strings = ['A', 'B', 'C', 'D', 'Z', 'B', 'B', 'C', 'A', 'B', 'C']

pat = ['A', 'B', 'C', 'D']

i = 0
while i < len(strings):
    if strings[i] not in pat:
        i += 1
        continue
    matches = 0
    for j in xrange(pat.index(strings[i]), len(pat)):
        if i + j - pat.index(strings[i]) >= len(strings):
            break
        if strings[i + j - pat.index(strings[i])] == pat[j]:
            matches += 1
        else:
            break
    if matches:
        print 'matched at index %d subsequence length: %d value %s' % (i, matches, strings[i])
        i += matches
    else:
        i += 1

Output:

matched at index 0 subsequence length: 4 value A
matched at index 5 subsequence length: 1 value B
matched at index 6 subsequence length: 2 value B
matched at index 8 subsequence length: 3 value A
Alexander
  • 841
  • 1
  • 9
  • 23
0

Here's a reasonably simple way that finds all matches of all lengths >= 1:

def findall(xs):
    from itertools import combinations
    # x2i maps each member of xs to a list of all the
    # indices at which that member appears.
    x2i = {}
    for i, x in enumerate(xs):
        x2i.setdefault(x, []).append(i)
    n = len(xs)
    for ixs in x2i.values():
        if len(ixs) > 1:
            for i, j in combinations(ixs, 2):
                length = 1 # xs[i] == xs[j]
                while (i + length < n and
                       j + length < n and
                       xs[i + length] == xs[j + length]):
                    length += 1
                yield i, j, length

Then:

for i, j, n in findall(strings):
    print("match of length", n, "at indices", i, "and", j)

displays:

match of length 4 at indices 0 and 6
match of length 1 at indices 3 and 9
match of length 3 at indices 1 and 7
match of length 2 at indices 2 and 8

What you do and don't want hasn't been precisely specified, so this lists all matches. You probably don't really want some of the them. For example, the match of length 3 at indices 1 and 7 is just the tail end of the match of length 4 at indices 0 and 6.

So you'll need to alter the code to compute what you really want. Perhaps you only want a single, maximal match? All maximal matches? Only matches of a particular length? Etc.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132