0

I want to store a large number of ngrams on disk in such a way that I can perform the following queries on it:

  • Fetch all ngrams
  • Fetch all ngrams of a certain size
  • Fetch all ngrams which contain all these given elements in any position (subset)
  • Fetch all ngrams of a certain size which have these given elements in these positions (template)

An example for the third point would be all ngrams containing 'a', 'b' and 'c' which results in ngrams like (a,b,c), (b,c,a), (x,a,z,b,c), etc.

An example for the fourth point would be all ngrams following the template (a, *, *, b) which results in ngrams like (a,x,y,b), (a,a,a,b), etc.

At the moment I'm storing them in a database table with a separate field for each element of the ngram but this doesn't seem to be the best option for searching ngrams containing given elements in any order and position. In order to search for 3grams containing "a", "b" and "c" I am using the following SQL 'where' clause:

WHERE
     (ele0 = 'a' OR ele1 = 'a' OR ele2 = 'a') AND
     (ele0 = 'b' OR ele1 = 'b' OR ele2 = 'b') AND
     (ele0 = 'c' OR ele1 = 'c' OR ele2 = 'c')

This does not scale up well at all. Is there a better way to structure the data and query it?

mtanti
  • 794
  • 9
  • 25
  • How many is a "large number"? Is there a maximum length of the ngrams? – Gordon Linoff Dec 14 '13 at 14:19
  • I'd like to keep it generic in order to allow for experiments. The number of ngrams can be in the millions and the length can be up to 20 perhaps. – mtanti Dec 14 '13 at 14:21
  • The elements are characters, or (indexes into) words ? – wildplasser Dec 14 '13 at 14:50
  • They are words. The examples use characters just for brevity. – mtanti Dec 14 '13 at 14:55
  • As we speak, I am trying to import http://twitter.com/Hubert_B_Both 's Markov trees into a recursive SQL-table. Since the table is recursive, the depth is virtually unlimited. The input is limited to 5-grams over a dictionary of about 150K words. Total is about 35M rows. – wildplasser Dec 14 '13 at 15:05

1 Answers1

2

You don't specify what a "large number" are. I cannot think, off-hand, of a way to support all the operations you want using standard SQL optimization methods. In some databases, full text support might help.

If you want to use SQL (which is perfectly reasonable as a persistent store), I would suggest that you simply use strings. In other words, an ngram is a string.

Your queries would look like:

select *
from ngrams;

select *
from ngrams
where len(ngram) = XXX

select *
from ngrams
where ngram like '%a%' and ngram like '%b%' and ngram like '%c%';

select *
from ngrams
where ngram like 'a__b';

You can then enhance this structure to make it more efficient for certain queries. For instance, if you want to optimize the queries to get length, then add a length column and index it (this will not be very useful unless you have a lot of different lengths). To optimize queries of the third type, add a new column that has the elements in alphabetical order (so, "CBA" would also have a column "ABC"). An index on this would facilitate queries of the third type.

EDIT (in response to comment):

I always thought n-grams referred first to individual characters, but Wikipedia says they are order sets of any items.

You can readily handle "words" with the above schema, just by introducing a separator that is not an allowed character in any word, say the '|' delimiter. So, the n-gram "ABC" would be stored as "|A|B|C|":

select *
from ngrams;

select *
from ngrams
where ngramLen = XXX

select *
from ngrams
where ngram like '%|a|%' and ngram like '%|b|%' and ngram like '%|c|%';

select *
from ngrams
where ngram like |a|%|b|' and ngramLen = 4;

In this case, you would want a separate field that had the number of elements, because you cannot calculate that readily using a length function.

Given that you are thinking of having millions of ngrams, you have a bit of a challenge. With words, this could occupy up to gigabytes of memory. For performance, you will want the table to fit into memory. These operations are very well suited for a parallel database, so the process will scale smoothly. One of the advantages to using a database, in fact, is that you can simply throw more memory/disk/processors at the problem, and you will get better performance.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Although in my examples I used characters as elements in reality they will be words. – mtanti Dec 14 '13 at 14:34
  • Won't the "like" operator slow down the query significantly? – mtanti Dec 14 '13 at 14:51
  • @mtanti . . . The operations you want are going to require full table scans. `like` is not slow by itself, it just often causes a full table scan. You don't provide enough information in the question to provide more guidance on optimization. – Gordon Linoff Dec 14 '13 at 15:00
  • If you separate the elements into different fields then you can index each one and search without using full table scans. Also, with your solution, how do you look for elements in specific positions? Your example works for finding ngrams which start and end with particular elements but how will it find ngrams of the template (*, a, *, b, *, *)? – mtanti Dec 14 '13 at 15:04
  • By having wildcards at the beginning and end of the string. I note that this discussion has veered pretty far from the original question. – Gordon Linoff Dec 14 '13 at 15:07
  • It has? I'm just asking about the performance and correctness of your solution. I don't see how it veered off. If the elements to search are not at the beginning and end of the ngram, how will the 'like' operator find the ngrams? – mtanti Dec 14 '13 at 15:35