Find longest matching ngrams in MySQL

Question

Given a column containing ngrams in a VARCHAR with utf8mb4_unicode_ci collation:

+---------------------------+
| ngram                     |
+---------------------------+
| stack overflow            |
| stack                     |
| overflow                  |
| stack overflow protection |
| overflow protection       |
| protection                |
+---------------------------+

And a query:

SELECT * FROM ngrams WHERE ngram IN ('stack', 'stack overflow', 'protection', 'overflow')

Given the rows returned by this query, how can I keep only the rows with the longest ngrams from the returned rows?

In this example, I get 3 rows: stack, stack overflow, and protection.

Then, I need to filter rows like this:

I filter out stack, because stack overflow exists in the returned rows
I keep stack overflow, because no other returned row is a ngram containing stack overflow (there is stack overflow protection in the table, but it's not in the returned rows)
I keep protection too
I filter out overflow, because stack overflow exists in the returned rows

It must be done in MySQL because of collations (comparisons outside of MySQL wouldn't give the same results than in MySQL). (Unless I'm not aware of some MySQL function allowing to expose the collated version of a string.)

I can think of the following solution: (sql fiddle)

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection')
AND     NOT EXISTS (
    SELECT  1
    FROM    ngrams n2
    WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection')
    AND     LENGTH(n2.ngram) > LENGTH(n1.ngram)
    AND     CONCAT(' ', n2.ngram, ' ') LIKE CONCAT('% ', n1.ngram, ' %')
)

It's inefficient, though, since the sub-query will be executed for every matched ngram.

So I'm searching for

either a way to make this query efficient
or a way to do this reliably outside of MySQL (taking collations into account)

What exactly do you want the query to return? It isin't clear and people are providing multiple solutions that aren't equivalent. — plalx, May 08 '14 at 13:29
As fancy as the answers try to be, `NOT EXISTS` probably [outperforms them all](http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/) since there are no ways to perform the operation in a single SELECT anyway. I believe that with a CTE it could possibly have been faster because you could use recursion, but MySQL doesn't seem to support anything like that. — plalx, May 09 '14 at 10:57
However, instead of checking `LENGTH(n2.ngram) > LENGTH(n1.ngram)` you can just check `n2.ngram <> n1.ngram` and I am not sure why you do `CONCAT(' ', n2.ngram, ' ')`? Do you need the spaces in the `LIKE`? — plalx, May 09 '14 at 11:04
@plalx some of the answers seem to have queries that are more efficient than the one in the question. For the CONCAT, it's to avoid matching part of words, e.g. `%foo%` would match `foobar`, whereas `% foo %` wouldn't. — Arnaud Le Blanc, May 09 '14 at 16:42
How often do you need to do this operation? This is an example of a directed graph problem, which you can probably solve by pre-processing the ngram table itself. Also, how large is the ngram table and how long is the `in` list? — Gordon Linoff, May 10 '14 at 03:49
@GordonLinoff the table has a few million entries, and the `in` may have up to 20 elements (usually less). — Arnaud Le Blanc, May 10 '14 at 18:55
@arnaud576875 . . . Depending on the number of relationships and the number of such queries, I think the approach in my solution using the pairs table is going to provide the best solution. Of course, this also depends on whether the ngrams table is being frequently modified, and whether the modifications are only inserts or include updates and deletes. — Gordon Linoff, May 10 '14 at 19:53
I don't think you have the proper data structure here. I know you mentioned you want this in MySQL, but I really think you may be better served using a different data structure (perhaps a trie) that is more suited to this sort of operation. — Mike Brant, May 12 '14 at 21:41

fthiella · Accepted Answer · 2014-05-09T16:19:38.883

If I understand your logic correctly, this query should give you the correct result:

SELECT n1.ngram
FROM
  ngrams n1 LEFT JOIN ngrams n2
  ON
    n2.ngram IN ('stack', 'stack overflow', 'protection')
    AND n2.ngram LIKE CONCAT('%', n1.ngram, '%')
    AND CHAR_LENGTH(n1.ngram) < CHAR_LENGTH(n2.ngram)
WHERE
  n1.ngram IN ('stack', 'stack overflow', 'protection')
  AND n2.ngram IS NULL;

Please see fiddle here. But since I expect that your table could have a lot of records, while your list of words is certanly much limited, why not remove the shortest ngrams from this list before executing the actual query? My idea is to reduce the list

('stack', 'stack overflow', 'protection')

to

('stack overflow', 'protection')

and this query should do the trick:

SELECT *
FROM
  ngrams
WHERE
  ngram IN (
    SELECT s1.ngram
    FROM (
      SELECT DISTINCT ngram
      FROM ngrams
      WHERE ngram IN ('stack','stack overflow','protection')
    ) s1 LEFT JOIN (
      SELECT DISTINCT ngram
      FROM ngrams
      WHERE ngram IN ('stack','stack overflow','protection')
    ) s2
      ON s2.ngram LIKE CONCAT('%', s1.ngram, '%')
         AND CHAR_LENGTH(s1.ngram) < CHAR_LENGTH(s2.ngram)
    WHERE
      s2.ngram IS NULL
  );

Yes I'm querying the table ngrams twice before joining the result back to ngrams again, because we have to make sure that the longest value actually exists in the table, but if you have a proper index on the ngram column the two derived queries that use DISTINCT should be very efficient:

ALTER TABLE ngrams ADD INDEX idx_ngram (ngram);

Fiddle is here.

Edit:

As samuil correctly noted, if you just need to find the shortest ngram and not the whole rows associated to it, then you don't need the outer query, and you can just execute the inner query. With the proper index, two SELECT DISTINCT queries will be very efficient, and even if the JOIN cannot be optimized (n2.ngram LIKE CONCAT('%', n1.ngram, '%') can't take advantage of an index) it will be executed only on a few already filtered records and should be quite fast.

@fthiella is there need for three references? As far as I understand in outer select you are selecting all ngrams that matches ngrams matched in subquery. Why can't you simply use this subquery as whole query? — samuil, May 09 '14 at 07:16

Ruud Helderman · Answer 2 · 2014-05-14T08:12:02.783

You are trying to filter the ngrams in the query itself. It is probably more efficient to do it in two steps. Start with a table with all possible ngrams:

CREATE TABLE original (ngram varchar(100) NOT NULL)
GO

CREATE TABLE refined (ngram varchar(100) NOT NULL PRIMARY KEY)
GO

INSERT INTO original (ngram)
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack', 'stack overflow', 'protection')
GO

INSERT INTO refined (ngram)
SELECT ngram
FROM original

Then delete the ones you do not want. For each ngram, generate all possible substrings. For each substring, delete that entry (if any) from the list. It takes a couple of nested loops, but unless your ngrams contain an extremely large number of words, it should not take much time.

CREATE PROCEDURE refine()
BEGIN
    DECLARE done INT DEFAULT FALSE;
    DECLARE words varchar(100);
    DECLARE posFrom, posTo int;
    DECLARE cur CURSOR FOR SELECT ngram FROM original;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    OPEN cur;

    read_loop: LOOP
        FETCH cur INTO words;
        IF done THEN
            LEAVE read_loop;
        END IF;

        SET posFrom = 1;
        REPEAT
            SET posTo = LOCATE(' ', words, posFrom);
            WHILE posTo > 0 DO
                DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom, posTo - posFrom);
                SET posTo = LOCATE(' ', words, posTo + 1);
            END WHILE;
            IF posFrom > 1 THEN
                DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom);
            END IF;
            SET posFrom = LOCATE(' ', words, posFrom) + 1;
        UNTIL posFrom = 1 END REPEAT;
    END LOOP;

    CLOSE cur;
END

What's left, is a table with only the longest ngrams:

CALL refine;

SELECT ngram FROM refined;

SQL Fiddle: http://sqlfiddle.com/#!2/029dc/1/1

EDIT: I added an index on table refined; now it should run in O(n) time.

why would this more complicated implementation be more efficient compare with single query statement ? — zinking, May 14 '14 at 07:47
@zinking: Rule of thumb: do _not_ use cursors. But in some cases, cursors are a necessary evil. All the declarative approaches I have seen so far seem to run in _O(n*n)_ time. A cursor-based approach should be able to run in _O(n)_ time (provided table `refined` is indexed; see my edit). With a big amount of records, expect a huge performance improvement. — Ruud Helderman, May 14 '14 at 09:05

score 5 · Answer 3 · answered May 12 '14 at 21:36

After doing this without first looking at the other solutions, I see that it's similar to your existing best solution, but slightly simpler to read and possibly a bit more efficient;

SELECT n1.ngram
FROM ngrams n1
LEFT JOIN ngrams n2
  ON n2.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
 AND n1.ngram <> n2.ngram
 AND INSTR(n2.ngram, n1.ngram) > 0
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
 AND n2.ngram IS NULL;

An SQLfiddle to test with.

Since there is no calculation on the AND n1.ngram <> n2.ngram line, the query should be able to use indexes a bit more efficiently.

samuil · Answer 4 · 2014-05-08T11:51:51.580

4

I think you can use self inner join on LIKE %original string% and choose only those rows that have ngram length equal to the longest joined ngram length.

SELECT n1.* FROM ngrams n1
  INNER JOIN ngrams n2 ON
    n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
    AND n2.ngram IN ('stack overflow', 'stack')
  WHERE n1.ngram IN ('stack overflow', 'stack')
  GROUP BY n1.ngram
  HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);

Downside of this solution is that you need to provide your string list twice.

It turns out that you don't need to provide list twice:

SELECT n1.*
  FROM ngrams n1
  INNER JOIN ngrams n2 ON
    n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
    AND n2.ngram IN ('stack overflow', 'stack')
  GROUP BY n1.ngram
  HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);

edited May 08 '14 at 11:51

answered May 08 '14 at 11:44

samuil

5,001
1
37
44

don`t work with `AND n2.ngram IN ('stack', 'stack overflow', 'protection')` – Subdigger May 08 '14 at 11:56
Strage, I have checked it with `'protection'` as well. What problem have you encountered with `'protection'` added? – samuil May 08 '14 at 11:58
result should be `stack overflow protection` but it is `stack overflow` – Subdigger May 08 '14 at 12:00
1

No. As far as I understood original problem you should get `stack overflow protection` string only if it was explicitly listed. – samuil May 08 '14 at 12:02
1

agreed.. for just 'protection', it should be `stack overflow` as there is no string as `protection` – G one May 08 '14 at 12:03
1

Working, but I would love to see an efficient solution (with a large table and up to 20 ngrams). – Arnaud Le Blanc May 08 '14 at 20:45
I believe that first variant (which provides string list twice) should be efficient if proper index is defined. It doesn't differ much in terms of performance from fthiella solution. – samuil May 09 '14 at 07:15

score 4 · Answer 5 · answered May 10 '14 at 04:04

This slight modification to your query:

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection') AND
        NOT EXISTS (SELECT  1
                    FROM    ngrams n2
                    WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection') AND
                            n2.ngram <> n1.ngram AND
                            n2.ngram LIKE CONCAT('% ', n1.ngram, ' %')
                   );

Should be pretty optimally fast with an index on ngrams(ngram). Note that this simplifies the like condition. I see no reason why you should be worried about word boundaries. Wouldn't "stacks" be a longer version of "stack"? (Although the items referred to by n-grams can be words, I associate them with letters unless otherwise noted.)

With the index, this should be equivalent in performance to other solutions using join.

If I had to do this zillions of times and the ngram table were not too big, I would preprocess it to get all pairs of "generalizations" -- ngram_pairs. This changes the above to

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection') AND
        NOT EXISTS (SELECT  1
                    FROM    ngram_pairs np
                    WHERE   np.ngram1 = n1.ngram and
                            np.ngram2 in ('stack', 'stack overflow', 'protection') 
                   )

This should perform much better than the like with an index on ngram_pairs(ngram1, ngram2). The following is the code for generating ngram_pairs:

create table ngram_pairs as
    select n1.ngram as ngram1, n2.ngram as ngram2
    from ngrams n1 join
         ngrams n2
         on length(n1.ngram) < length(n2.ngram) and
            n2.ngram like concat('%', n1.ngram, '%');

create index ngram_pairs_ngram1_ngram2 on ngram_pairs(ngram1, ngram2);

score 3 · Answer 6 · answered May 11 '14 at 10:48

Try this query using user variable

select 
  ngram
from 
  (select 
    ngram, 
    @t:=if(@prev=rank, @t+1, 1) as num,
    @prev:=rank
  from 
    (select 
      ngram,
      @rank:=if(@prev like concat(ngram,'%'), @rank, @rank+1) as rank,
      CHAR_LENGTH(ngram) as size,
      @prev:=ngram
    from 
      tbl 
    join 
      (select 
         @prev:='', 
         @rank:=1) t 
    where 
       ngram in ('stack overflow', 'stack', 'protection')
    order by 
       rank, size desc
   )t
  join 
    (select 
       @t:=0, 
       @prev:=0) t1
    ) t 
  where 
    num =1

Fiddle

|          NGRAM |
|----------------|
| stack overflow |
|     protection |

Dipin · Answer 7 · 2014-05-14T03:09:47.583

The following query only scans the data once and provides the correct results (fiddle):

SELECT my_ngrams.ngram
  FROM (SELECT CASE WHEN @v LIKE CONCAT('%',n1.ngram,'%') THEN 1 ELSE 0 END AS ngram_match
             , @v:=concat(@v,',',n1.ngram) AS ngram_concat
             , n1.ngram
          FROM    ngrams n1, (SELECT @v := '') r
         WHERE   n1.ngram IN ('stack', 'stack overflow', 'overflow', 'protection', 'overflow protection')
      ORDER BY length(n1.ngram) DESC) my_ngrams
 WHERE my_ngrams.ngram_match <> 1
;

However, it relies on the behavior of user-defined variables in MySQL (http://dev.mysql.com/doc/refman/5.5/en/user-variables.html) and should be used with some caution as a result.

The "order by" is important to the solution as that impacts how the user-defined variable is evaluated on a row-by-row basis which impacts which rows get matched by the case and later filtered.

It also concatenates all results together to search through for ngram matches before filtering so you should be aware that you could end up with a concatenated string that is wider than the maximum allowed by MySQL (http://dev.mysql.com/doc/refman/5.5/en/char.html).

This should be very efficient even for large tables as long as the column is indexed properly.

Looks great, however it works only if two ngrams share the same prefix. E.g. with `IN('stack overflow', 'overflow')`, I should get only `stack overflow`, but I get `overflow` too: http://www.sqlfiddle.com/#!2/f8be79/78 — Arnaud Le Blanc, May 12 '14 at 13:25
Updated to handle that case. But what about if there's also a word that is a substring of another word? E.g. with `IN('stack', 'stack overflow', 'stac')`? What should the result be then? Right now, `stac` gets removed as being a duplicate. See http://www.sqlfiddle.com/#!2/86a21/2 — Dipin, May 12 '14 at 23:51

Joseph B · Answer 8 · 2014-05-09T01:16:47.730

2

Here is an alternative using a LEFT JOIN.

The table is self-joined on the condition that no ngram exists that is contained within another ngram and that it is not equal to the ngram in the self-joined table. Sub-queries have been avoided, keeping performance in mind.

EDIT:

Added filter conditions.

SELECT n1.ngram
FROM ngrams n1
LEFT JOIN 
(
  SELECT ngram
  FROM ngrams
  WHERE ngram IN ('stack', 'stack overflow', 'protection')) n2
ON n2.ngram like Concat('%', n1.ngram, '%') and n1.ngram <> n2.ngram
WHERE n2.ngram IS NULL
AND n1.ngram IN ('stack', 'stack overflow', 'protection');

If you are checking to see if only the start of the ngram is contained in another ngram, you can replace the JOIN condition with ON n2.ngram like Concat(n1.ngram, '%') and n1.ngram <> n2.ngram.

I added more values in the SQL Fiddle:

'xyz' (which is not contained to any other ngram)
'stack overflow exception' (which is another parent of 'stack overflow')
'stack overflow exception handling' (which is the parent of 'stack overflow exception')

SQL Fiddle demo

Reference:

JOIN syntax on MySQL Reference Manual

edited May 09 '14 at 01:16

answered May 08 '14 at 13:28

Joseph B

5,519
1
15
19

What is point of using `LEFT JOIN` when you are checking if it `IS NULL` in `WHERE` clause? Is there any difference? – samuil May 08 '14 at 13:32
@samuil Yes, there is. The LEFT JOIN gets the rows that satisfy the JOIN condition plus all the rows from n1. We want only rows that do not satisfy the JOIN condition (comparable to rows that do not exist / not in a subquery with the join condition). Hence, we check if n2 value is NULL. – Joseph B May 08 '14 at 13:46
Here's a blog that discusses the various approaches in terms of performance: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/ – Joseph B May 08 '14 at 13:46
Sorry -- my question missed what I was comparing `LEFT JOIN` to. I meant that in this case `INNER JOIN` should be equivalent. – samuil May 08 '14 at 13:50
1

@samuil No worries. An INNER JOIN would get you the rows that DO satisfy the condition (that there is another ngram which contains the current ngram). On the other hand, we want ngrams that do NOT satisfy the condition. Hence, this is an anti-join. – Joseph B May 08 '14 at 13:55
@samuil You can play with the SQL Fiddle to see the differences between the different joins too :-) – Joseph B May 08 '14 at 14:19
Thanks. This doesn't seem to solve the problem, though. The expected result for http://www.sqlfiddle.com/#!2/f8be79/28 should be 'stack overflow', 'protection'. – Arnaud Le Blanc May 08 '14 at 20:02
@arnaud576875 Thank you for feedback. I have updated the query to include join conditions. This query should perform quite optimally. fthiella's index suggestion is a good one. I tried my query with fthiella's EXPLAIN PLAN. The # of steps required is less without requiring subqueries. – Joseph B May 09 '14 at 01:21

chetan · Answer 9 · 2014-05-09T04:40:41.097

Try this one: Fiddle

SELECT * 
FROM   tab 
WHERE  ngram NOT IN (SELECT DISTINCT b.ngram 
                     FROM   tab a, 
                            tab b 
                     WHERE  a.ngram != b.ngram 
                            AND a.ngram LIKE Concat('%', b.ngram, '%'));

If you want to include only those on the list which exists in the table then try this query:-

SELECT b.ngram ab 
FROM   (SELECT * 
        FROM   tab 
        WHERE  ngram IN ( 'stack', 'stack overflow', 'protection' )) a, 
       (SELECT * 
        FROM   tab 
        WHERE  ngram IN ( 'stack', 'stack overflow', 'protection' )) b 
WHERE  a.ngram LIKE Concat('%', b.ngram, '%') 
GROUP  BY b.ngram 
HAVING Count(*) = 1

Demo2

It doesn't seem to accept list of ngrams as an argument. – samuil May 08 '14 at 14:04 — samuil, May 08 '14 at 14:04

score 0 · Answer 10 · answered May 08 '14 at 13:52

0

SELECT * FROM   ngrams a WHERE  a.n NOT IN (SELECT DISTINCT a.n 
                 FROM   ngrams b
                 WHERE b.n != a.n 
                    AND b.n LIKE CONCAT('%', a.n, '%'));

answered May 08 '14 at 13:52

sanjaykumar

29
9

score 0 · Answer 11 · answered May 08 '14 at 13:54

0

SELECT  a.ngram FROM ngram a  CROSS JOIN (SELECT ngram AS ngram1 FROM ngram) b 
ON b.ngram1 LIKE CONCAT('%', a.ngram, '%') 
WHERE length(a.ngram) <= length(b.ngram1) 
GROUP BY a.ngram HAVING COUNT(a.ngram) = 1 ORDER BY LENGTH(b.ngram1) DESC

answered May 08 '14 at 13:54

Dmitry Seleznev

955
7
7

echo_Me · Answer 12 · 2014-05-08T18:29:42.410

-3

Try

 ORDER BY LENGTH(ngram) DESC and use LIMIT 1

EDIT:

try that :

  SELECT n1.ngram
  FROM ngrams n1 
  INNER JOIN ngrams n2
  ON LENGTH(n2.ngram) < LENGTH(n1.ngram)
  WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection')
  GROUP BY n1.ngram

edited May 08 '14 at 18:29

answered May 05 '14 at 19:01

echo_Me

37,078
5
58
78

7

Not answering the question – Arnaud Le Blanc May 05 '14 at 20:01

Find longest matching ngrams in MySQL

12 Answers12

Fiddle