Python finding most common pattern in list of strings

Question

I have a large list of API calls stored as strings, which have been stripped of all common syntax('htttp://', '.com', '.', etc..)

I would like to return a dictionary of the most common patterns which have a length > 3, where the keys are the found patterns and values are the number of occurrences of each pattern. I've tried this:

calls = ['admobapioauthcert', 'admobapinewsession', 'admobendusercampaign']

>>> from itertools import takewhile, izip
>>> ''.join(c[0] for c in takewhile(lambda x: all(x[0] == y for y in x), izip(*calls)))

returns:

'admob'

I would like it to return:

{'obap': 2, 'dmob': 3, 'admo': 3, 'admobap': 2, 'bap': 2, 'dmobap': 2, 'admobapi': 2, 'moba': 2, 'bapi': 2, 'dmo': 3, 'obapi': 2, 'mobapi': 2, 'admob': 3, 'api': 2, 'dmobapi': 2, 'dmoba': 2, 'mobap': 2, 'mob': 3, 'adm': 3, 'admoba': 2, 'oba': 2}

-My current method only works at identifying prefixes, but i need it to operate on all characters, regardless of it's position in the string, and again I would like to store the number of occurrences of each pattern as dict values. (I've tried other methods to accomplish this, but they are quite ugly).

I would just split on a dot, save to a dictionary, and filter for more than one occurrence. You can't really store every possible substring of every url, that would take immense amounts of memory and time. — Tomasz Kaminski, Nov 19 '15 at 03:02
the dots don't always exist in the strings. I'm not so concerned about memory/storage.. our aws setup is pretty beastly. — Benjamin James, Nov 19 '15 at 03:12
It is not possible to get {admob:3,mob:3,api:2} as you say . The question should be edited again and be changed to {'obap': 2, 'dmob': 3, 'admo': 3, 'admobap': 2, 'bap': 2, 'dmobap': 2, 'admobapi': 2, 'moba': 2, 'bapi': 2, 'dmo': 3, 'obapi': 2, 'mobapi': 2, 'admob': 3, 'api': 2, 'dmobapi': 2, 'dmoba': 2, 'mobap': 2, 'mob': 3, 'adm': 3, 'admoba': 2, 'oba': 2} — Akshay Hazari, Nov 19 '15 at 06:47

Learner · Accepted Answer · 2015-11-19T06:20:26.490

Use Collections.Counter, then split by dot afterall use dict comprehension-

>>>from collections import Counter
>>>calls = ['admob.api.oauthcert', 'admob.api.newsession', 'admob.endusercampaign']
>>>l = '.'.join(calls).split(".")
>>>d = Counter(l)
>>>{k:v for k,v in d.most_common(3) }
>>>{'admob': 3, 'api': 2}
>>>{k:v for k,v in d.most_common(4) }
>>>{'admob': 3, 'api': 2, 'newsession': 1, 'oauthcert': 1}

Or

>>>import re
>>>from collections import Counter
>>>d =  re.findall(r'\w+',"['admob.api.oauthcert', 'admob.api.newsession', 'admob.endusercampaign']")
>>>{k:v for k,v in Counter(d).most_common(2)}
>>>[('mob', 3), ('admob', 3), ('api', 2)]

Or

>>>from collections import Counter
>>>import re
>>>s= "['admobapioauthcert', 'admobapinewsession', 'admobendusercampaign']"
>>>w=[i for sb in re.findall(r'(?=(mob)|(api)|(admob))',s) for i in sb ]#Change (mob)|(api)|(admob) what you want
>>>{k:v for k,v in Counter(filter(bool, w)).most_common()}
>>>{'mob': 3, 'admob': 3, 'api': 2}

As i mentioned in the comment above, the dots don't always appear in the string. I've edited the question to be more clear about this. I appreciate your help. — Benjamin James, Nov 19 '15 at 05:43

Akshay Hazari · Answer 2 · 2015-11-19T06:50:08.473

2

Is this what you'd you wanted. Its gives the common patterns of strings after splitting on a dot.

calls = ['admob.api.oauthcert', 'admob.api.newsession', 'admob.endusercampaign']
from collections import Counter
Counter(reduce(lambda x,y: x+y,map (lambda x : x.split("."),calls))).most_common(2)

O/P: [('admob', 3), ('api', 2)]

filter(lambda x: x[1]>1 ,Counter(reduce(lambda x,y: x+y,map (lambda x : x.split("."),calls))).most_common())

Update : I dont know if this would work for you:

calls = ['admobapioauthcert', 'admobapinewsession', 'admobendusercamp']
filter(lambda x : x[1]>1 and len(x[0])>2,Counter(reduce(lambda x,y:x + y,reduce(lambda x,y: x+y, map(lambda z :map(lambda x : map(lambda g: z[g:x+1],range(len(z[:x+1]))),range(len(z))),calls)))).most_common())

O/P:

[('admo', 3), ('admob', 3), ('adm', 3), ('mob', 3), ('dmob', 3), ('dmo', 3), ('bapi', 2), ('dmobapi', 2), ('dmoba', 2), ('api', 2), ('obapi', 2), ('admobap', 2), ('admoba', 2), ('mobap', 2), ('dmobap', 2), ('bap', 2), ('mobapi', 2), ('moba', 2), ('obap', 2), ('oba', 2), ('admobapi', \
2)]

edited Nov 19 '15 at 06:50

answered Nov 19 '15 at 04:35

Akshay Hazari

3,186
4
48
84

Not quite- as stated above, the dots do not always appear in the calls. Thanks for your efforts. – Benjamin James Nov 19 '15 at 05:41
This gives all the patterns > len(3) . Only that probably on a huge dataset would take time and one more thing is you would want admob as an answer . but this would give you adm,admo,admob,dmo,dmob,mob . so you would need to do more filtering . But this would definitely work – Akshay Hazari Nov 19 '15 at 06:31
1

this is perfect- and you're right about filtering. Bumping the minimum string length to 4 should be a good start. well written sir – Benjamin James Nov 19 '15 at 06:58

Python finding most common pattern in list of strings

2 Answers2

Linked