Given a column containing ngrams in a VARCHAR
with utf8mb4_unicode_ci
collation:
+---------------------------+
| ngram |
+---------------------------+
| stack overflow |
| stack |
| overflow |
| stack overflow protection |
| overflow protection |
| protection |
+---------------------------+
And a query:
SELECT * FROM ngrams WHERE ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
Given the rows returned by this query, how can I keep only the rows with the longest ngrams from the returned rows?
In this example, I get 3 rows: stack
, stack overflow
, and protection
.
Then, I need to filter rows like this:
- I filter out
stack
, becausestack overflow
exists in the returned rows - I keep
stack overflow
, because no other returned row is a ngram containingstack overflow
(there isstack overflow protection
in the table, but it's not in the returned rows) - I keep
protection
too - I filter out
overflow
, becausestack overflow
exists in the returned rows
It must be done in MySQL because of collations (comparisons outside of MySQL wouldn't give the same results than in MySQL). (Unless I'm not aware of some MySQL function allowing to expose the collated version of a string.)
I can think of the following solution: (sql fiddle)
SELECT ngram
FROM ngrams n1
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection')
AND NOT EXISTS (
SELECT 1
FROM ngrams n2
WHERE n2.ngram IN ('stack', 'stack overflow', 'protection')
AND LENGTH(n2.ngram) > LENGTH(n1.ngram)
AND CONCAT(' ', n2.ngram, ' ') LIKE CONCAT('% ', n1.ngram, ' %')
)
It's inefficient, though, since the sub-query will be executed for every matched ngram.
So I'm searching for
- either a way to make this query efficient
- or a way to do this reliably outside of MySQL (taking collations into account)