I have an input file consisting of lines with numbers and word sequences, structured like this:
\1-grams:
number w1 number
number w2 number
\2-grams:
number w1 w2 number
number w1 w3 number
number w2 w3 number
\end\
I want to store the word sequences (so-called n-grams) in such a way that I can easily retrieve both numbers for each unique n-gram. What I do now, is the following:
all = {}
ngrams = {}
for line in open(file):
m = re.search('\\\([1-9])-grams:',line.strip()) # find nr of words in sequence
if m != None:
n = int(m.group(1))
ngrams = {} # reinitialize dict for new n
else:
m = re.search('(-[0-9]+?[\.]?[0-9]+)\t([^\t]+)\t?(-[0-9]+\.[0-9]+)?',line.strip()) #find numbers and word sequence
if m != None:
ngrams[m.group(2)] = '{0}|{1}'.format(m.group(1), m.group(3))
elif "\end\\" == line.strip():
all[int(n)] = ngrams
In this way I can easily and quite quickly find the numbers for e.g. the sequence s='w1 w2' this way:
all[2][s]
The problem is that this stored procedure is rather slow, especially when there are a lot (>100k) of n-grams and I'm wondering whether there is a faster way to achieve the same result without having a decrease in access speed. Am I doing something suboptimal here? Where can I improve?
Thanks in advance,
Joris