Given a string of characters, I want to create a dictionary of all the n-character substrings contained in the string, where the dictionary key is the substring and the value is a list. The first element of the list is the number of occurrences of the substring, and the second element of the list is a list of starting locations for those occurrences.
For example, with n=3
, the string 'abcdabcxdabc'
results in this dictionary:
{'abc': [3, [0, 4, 9]],
'cda': [1, [2]],
'dab': [2, [3, 8]],
'bcd': [1, [1]],
'cxd': [1, [6]],
'bcx': [1, [5]],
'xda': [1, [7]]}
The code below works and is efficient since it only walks through the string once, but I'm wondering if there is a more elegant and/or more pythonic way to do this, maybe using a dictionary comprehension. I'm pretty new to python and still trying to figure out when it makes sense (or is even possible) to use comprehensions, etc.
text = 'abcdabcxdabc'
n = 3
d = {}
for i in range(len(text) - n + 1):
sub = text[i:i + n]
if sub in d:
d[sub][0] += 1
d[sub][1].append(i)
else:
d[sub] = [1, [i]]
print(d)
Update: Thanks for all the replies. They generally confirm my suspicion that this is too complex to be efficiently implemented in a single comprehension (but thanks to volcano for showing that it is possible if efficiency isn't a concern). Thanks to RemcoGerlich and Ignacio Vazquez-Abrams for pointing me to defaultdict. I'll have to dig into that more. My timing results suggest that there's a little more overhead in initializing a defaultdict compared to a dict, but that the running time might be slightly faster, at least for this example. (Timing results posted in a separate comment below.) Now I'm wondering if there are any situations where dict would be preferred over defaultdict. Also, thanks to Narcolei for pointing me to the timeit functions.