Fastest way to store n-grams (strings with variable amount of words) in python

Question

I have an input file consisting of lines with numbers and word sequences, structured like this:

\1-grams:
number   w1    number
number   w2    number
\2-grams:
number   w1 w2   number
number   w1 w3   number
number   w2 w3   number
\end\

I want to store the word sequences (so-called n-grams) in such a way that I can easily retrieve both numbers for each unique n-gram. What I do now, is the following:

all = {}
ngrams = {}
for line in open(file):
    m = re.search('\\\([1-9])-grams:',line.strip()) # find nr of words in sequence
    if m != None:
        n = int(m.group(1))
        ngrams = {} # reinitialize dict for new n
    else:
        m = re.search('(-[0-9]+?[\.]?[0-9]+)\t([^\t]+)\t?(-[0-9]+\.[0-9]+)?',line.strip()) #find numbers and word sequence
        if m != None:
            ngrams[m.group(2)] = '{0}|{1}'.format(m.group(1), m.group(3))
        elif "\end\\" == line.strip():
            all[int(n)] = ngrams

In this way I can easily and quite quickly find the numbers for e.g. the sequence s='w1 w2' this way:

all[2][s]

The problem is that this stored procedure is rather slow, especially when there are a lot (>100k) of n-grams and I'm wondering whether there is a faster way to achieve the same result without having a decrease in access speed. Am I doing something suboptimal here? Where can I improve?

Thanks in advance,

Joris

Which is slow: loading the data from disk, or using it? Both? — Jason Orendorff, Jun 27 '13 at 14:38
Using the data is slow. I didn't know there were ways to optimize "for l in open(f)" ?! — niefpaarschoenen, Jun 27 '13 at 14:41
@niefpaarschoenen: I meant to ask: which is slow, the code you posted that parses the data and puts it into the dictionaries; or the code that uses the dictionaries after that? — Jason Orendorff, Jun 27 '13 at 15:50
@Jason: access is fast enough, parsing is very slow. I thought I mentioned that :-). — niefpaarschoenen, Jun 27 '13 at 19:59

Jason Orendorff · Accepted Answer · 2013-06-27T15:02:09.140

6

I would try doing fewer regexp searches.

It's worth considering a few other things:

Storing all the data in a single dictionary may speed things up; a data hierarchy with extra layers doesn't help, perhaps counterintuitively.
Storing a tuple lets you avoid calling .format().
In CPython, code in functions is faster than global code.

Here's what it might look like:

def load(filename):
    ngrams = {}
    for line in open(filename):
        if line[0] == '\\':
            pass  # just ignore all these lines
        else:
            first, rest = line.split(None, 1)
            middle, last = rest.rsplit(None, 1)
            ngrams[middle] = first, last
    return ngrams

ngrams = load("ngrams.txt")

I would want to store int(first), int(last) rather than first, last. That would speed up access, but slow down load time. So it depends on your workload.

I disagree with johnthexii: doing this in Python should be much faster than talking to a database, even sqlite, as long as the data set fits in memory. (If you use a database, that means you can do the load once and not have to repeat it, so sqlite may end up being exactly what you want—but you can't do that with a :memory: database.)

edited Jun 27 '13 at 15:02

answered Jun 27 '13 at 14:47

Jason Orendorff

42,793
6
62
96

2

A very good answer, but I'd change the if/else block to simply `if not line[0] == '\\':`, which saves two lines of code :) – l4mpi Jun 27 '13 at 15:21
@l4mpi You're probably right! I wrote it like this because the comment on `pass` explains a pretty big difference between this and the original code. It seemed worth putting that up front. – Jason Orendorff Jun 27 '13 at 16:02
Nice answer indeed, testing your suggestions right now. Will give feedback asap. – niefpaarschoenen Jun 27 '13 at 20:04
I can already say though that for my purpose, storing the string is preferable to storing the number (float actually, not int, but I didn't specify that), because parsing is slow and access is ok. – niefpaarschoenen Jun 27 '13 at 20:05
1

I managed to reduce my processing time with around 50%, mostly by getting rid of the regular expressions. Using a single dictionary actually slows it down quite a bit! And using a tuple doesn't help; if anything it was on average a little slower than the string format. Frustrating though that a C implementation that reads these same files is still 10 times faster. – niefpaarschoenen Jun 27 '13 at 21:42
niefpaarschoenen, Can you post a sample data file somewhere? – Jason Orendorff Jul 01 '13 at 16:26
Need to optimize a bit more, sample file available on http://homes.esat.kuleuven.be/~jpeleman/sample.arpa – niefpaarschoenen Feb 06 '14 at 09:03
The sample file is 275MB by the way. – niefpaarschoenen Feb 06 '14 at 14:29

score 3 · Answer 2 · answered Jun 27 '13 at 14:51

3

Regarding optimization of your code.

1) compile the regular expressions before loop. See help for re.compile.

2) Avoid regular expressions whenever it's possible. For example "-grams" string prepended with number can be checked by simple string comparison

answered Jun 27 '13 at 14:51

LiMar

2,822
3
22
28

This made a huge difference! – niefpaarschoenen Jun 27 '13 at 21:18

score 0 · Answer 3 · answered Jun 27 '13 at 14:40

0

Personally I would move to a database (sqllite3 is built in to python) with indexes. Indexes make queries go fast. Python also supports in memory sqllite databases.

You can also supply the special name :memory: to create a database in RAM.

answered Jun 27 '13 at 14:40

John

13,197
7
51
101

Fastest way to store n-grams (strings with variable amount of words) in python

3 Answers3