I want to store a large number of ngrams on disk in such a way that I can perform the following queries on it:
- Fetch all ngrams
- Fetch all ngrams of a certain size
- Fetch all ngrams which contain all these given elements in any position (subset)
- Fetch all ngrams of a certain size which have these given elements in these positions (template)
An example for the third point would be all ngrams containing 'a', 'b' and 'c' which results in ngrams like (a,b,c), (b,c,a), (x,a,z,b,c), etc.
An example for the fourth point would be all ngrams following the template (a, *, *, b) which results in ngrams like (a,x,y,b), (a,a,a,b), etc.
At the moment I'm storing them in a database table with a separate field for each element of the ngram but this doesn't seem to be the best option for searching ngrams containing given elements in any order and position. In order to search for 3grams containing "a", "b" and "c" I am using the following SQL 'where' clause:
WHERE
(ele0 = 'a' OR ele1 = 'a' OR ele2 = 'a') AND
(ele0 = 'b' OR ele1 = 'b' OR ele2 = 'b') AND
(ele0 = 'c' OR ele1 = 'c' OR ele2 = 'c')
This does not scale up well at all. Is there a better way to structure the data and query it?