Databases: Effectively implement string contains query

Question

I need a way to effectively do a string contains query like:

# In SQL
LIKE '%some-string%'

# In mongo
{ $regex: /some-string/ }

But its very slow when the dataset size is big. Eg. I tried in a dummy DB (with and without an index - no index is surprisingly faster on mongo) and generate 100m rows (in reality theres more). Seems reasonable if I use ElasticSearch, but I am wondering if theres a DB or way I can structure my data to optimise this use case? I asked and I really need contains instead of a prefix match ...

This is like 3 questions in one. Please focus the question to a specific database and technology. — The Impaler, Feb 18 '23 at 13:13

score 1 · Accepted Answer · answered Feb 18 '23 at 17:12

Postgresql offers so-called trigram indexes. Those indexes can accelerate SQL col LIKE '%search%' predicates efficiently enough. Notice that indexing can, in all makes of server, speed up col LIKE 'string%' (without the leading wildcard character).

MySQL / Mariadb have FULLTEXT indexes that work with a distinctive SQL syntax. That feature works word-by-word unlike, well, LIKE which works character-by-character. Microsoft SQL Server has a similar feature with different syntax. It also works word-by-word.

So, there's no SQL standard way to do this efficiently, and different makes of server do it differently.

If you haven't yet chosen a particular make of server, you should figure out whether one of the full text schemes will serve your purpose. If you must get good performance from LIKE, postgresql's trigram indexing is the way to go.

Wow awesome! Let me investigate PG way! I tried MySQL LIKE and FULLTEXT alr. LIKE fits the usecase best (as it can do substring / non word match) however it can be slow too on a large dataset — Jiew Meng, Feb 18 '23 at 23:27

score 0 · Answer 2 · answered Feb 19 '23 at 22:36

There's no general solution to this that works for all database systems i think. As another answer already explains, there are fulltext search extensions to a lot of popular database systems that, while they're far from being able to do what stuff like Lucene/ElasticSearch can do, should be enough to massively speed up your use case.

Let me explain this from a database internals perspective. Let's say that your selectivity is high a.k.a only a very small percentage of your tuples actually match your condition then you would generally want to have some kind of index structure. The kind of index structure you would need for this kind of query is some kind of Radix-Tree/Trie but that's not a standard data structure implemented in all SQL databases. The only data structure that is actually implemented in almost all SQL databases is a B-Tree. But a B-Tree can only do Prefix queries something like LIKE 'test%'. The only chance you have for LIKE '%test%' if your database doesn't have such indexes is having a very fast runtime system which none of the traditional (open source) database systems has...

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Feb 24 '23 at 01:00

Databases: Effectively implement string contains query

2 Answers2