How to use REGEXP_SUBSTR for searches between two tables with Snowflake?

Question

I am currently using the following code to be able to do an intelligent text search between two tables, none of the tables have any relationship that joins them.

The goal of my SQL is to be able to find the text of table_A that is in table_b, regardless of whether it is accompanied by special characters.

With the following code using the REGEXP_SUBSTR instruction, I am getting two problems:

1.) My SQL performance decreases exponentially when we talk about many records to match (Is there another better way?
2.) when the text has a special character it doesn't work for example the '.'

Thank you

--Create test tables
CREATE OR REPLACE TEMPORARY TABLE TABLE_A
AS 
SELECT  'heLLO'    AS CHAINE
       ,'ENGLISH' AS  TYPE
UNION
SELECT  'HI'     AS CHAINE
       ,'ENGLISH' AS  TYPE
UNION
SELECT  'bONJOUR' AS CHAINE
       ,'FRENCH' AS  TYPE
UNION
SELECT  'hOLa' AS CHAINE
       ,'SPANISH' AS  TYPE
       
;
       



CREATE OR REPLACE TEMPORARY TABLE TABLE_B
AS 
SELECT  'HELLO *'    AS CHAINE     
UNION
SELECT  'HI.'     AS CHAINE     
UNION
SELECT  'BONJOUR -' AS CHAINE     
UNION
SELECT  'hOLa' AS CHAINE
       
;

Here the query that makes the match between the two tables


SELECT TABLE_A.* ,TABLE_B.*
FROM TABLE_A
INNER JOIN TABLE_B ON
     ( 
       TABLE_A.TYPE ='ENGLISH'
       AND REGEXP_SUBSTR (TABLE_A.CHAINE 
                         ,'.*\\b' ||    REPLACE(TABLE_B.CHAINE,'.','.\\')  || '\\b.*'  
                                                                          ,1
                                                       ,1
                                                       ,'i') IS NOT NULL

)

The current result is good, the word 'heLLO' was found no matter if it was case sensitive, however the word 'HI' was not found as it had a dot

If its just case insensitive substr search, cant you just use - where position(lower(t1.chaine), lower(t2.chaine))>0 ? — Pankaj, Jun 14 '22 at 17:34
It is a good option, but I find problems when, for example, the word is part of another, for example : Search word : Hello, Word in table b : HelloWorld . It will bring me HelloWorld because Hello is inside , and really I only need the matches of Hello — Ser, Jun 14 '22 at 19:40
In your code, try changing "replace" to - REPLACE(TABLE_B.CHAINE,'.',''), else use regexp_replace with correct format strings — Pankaj, Jun 14 '22 at 20:12
Okay. perfect :) , i think it will work like this with REPLACE(TABLE_B.CHAINE,'.','') but i still have the performance problem, with the mentioned code, do you have any clue how i could improve the performance? I suspect that because there is an inner join linked to the two tables, it becomes a cartesian join. and the query will be much slower as the tables grow — Ser, Jun 14 '22 at 20:21

score 0 · Answer 1 · answered Jun 20 '22 at 03:47

It's slow because you are breaking the golden rule "never use function in your WHERE clause", given an ON is just a special WHERE it still hold. If you want good performance in any database, you want to use EQUI Joins (aka a = b).

If we looking at what this toy example is doing you can see you join is really a CROSS JOIN with some expensive WHERE clauses:

with TABLE_A(chaine, type) as (
    select * from values
        ('heLLO','ENGLISH'),
        ('HI','ENGLISH'),
        ('bONJOUR','FRENCH'),
        ('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
    select * from values
        ('HELLO *'), 
        ('HI.'),   
        ('BONJOUR -'),
        ('hOLa')
), pre_tb as (
    select 
        *
        ,REPLACE(CHAINE,'.','') as rep
        ,'.*\\b' || rep || '\\b.*' as pata
        ,'\\b' || rep || '\\b' as patb
    from table_b
)
SELECT ta.*
    ,tb.*
    ,REGEXP_SUBSTr (ta.CHAINE , tb.pata ,1 ,1 ,'i') IS NOT NULL as reg_a
    ,REGEXP_SUBSTr (ta.CHAINE , tb.patb ,1 ,1 ,'i') IS NOT NULL as reg_b
FROM TABLE_A as ta
JOIN pre_tb as tb 
    ON ta.TYPE ='ENGLISH'
       --AND REGEXP_SUBSTr (ta.CHAINE , tb.pat ,1 ,1 ,'i') IS NOT NULL

I applied the suggestion to update the REPLACE, but the above gives:

CHAINE	TYPE	CHAINE_2	REP	PATA	PATB	REG_A	REG_B
heLLO	ENGLISH	HELLO *	HELLO *	.\bHELLO \b.*	\bHELLO *\b	TRUE	TRUE
heLLO	ENGLISH	HI.	HI	.\bHI\b.	\bHI\b	FALSE	FALSE
heLLO	ENGLISH	BONJOUR -	BONJOUR -	.\bBONJOUR -\b.	\bBONJOUR -\b	FALSE	FALSE
heLLO	ENGLISH	hOLa	hOLa	.\bhOLa\b.	\bhOLa\b	FALSE	FALSE
HI	ENGLISH	HELLO *	HELLO *	.\bHELLO \b.*	\bHELLO *\b	FALSE	FALSE
HI	ENGLISH	HI.	HI	.\bHI\b.	\bHI\b	TRUE	TRUE
HI	ENGLISH	BONJOUR -	BONJOUR -	.\bBONJOUR -\b.	\bBONJOUR -\b	FALSE	FALSE
HI	ENGLISH	hOLa	hOLa	.\bhOLa\b.	\bhOLa\b	FALSE	FALSE

so what this is showing is you don't need the wild .* on the start/end of the REGEXP_SUBSTR. as REG_A and REG_B are equal results. But If also shows the start of pre-processing your data.

Also of note the HELLO * only works because * is valid regex syntax.. which makes this code rather dangerous. That stuff should really be stripped off, or correctly escaped.

The REGEX_SUBSTR can be replaced with REGEXP_LIKE because you really are only asking is there a match, and not want the result, thus don't ask for more than you want.

Thus I would be inclined to use code like this:

with TABLE_A(chaine, type) as (
    select * from values
        ('heLLO','ENGLISH'),
        ('HI','ENGLISH'),
        ('bONJOUR','FRENCH'),
        ('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
    select * from values
        ('HELLO *'), 
        ('HI.'),   
        ('BONJOUR -'),
        ('hOLa')
), pre_ta as (
    select chaine, lower(chaine) as l_chaine, type
    from table_a
), pre_tb as (
    select 
        *
        ,regexp_substr(lower(CHAINE), '[a-z]+') as rep
        ,'\\b' || rep || '\\b' as pat
    from table_b
)
SELECT ta.*
    ,tb.*
    ,REGEXP_LIKE (ta.CHAINE , tb.pat) as reg
FROM pre_ta as ta
JOIN pre_tb as tb 
    ON ta.TYPE ='ENGLISH'

and if you are just want the first match to work:

with TABLE_A(chaine, type) as (
    select * from values
        ('heLLO','ENGLISH'),
        ('HI','ENGLISH'),
        ('bONJOUR','FRENCH'),
        ('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
    select * from values
        ('HELLO *'), 
        ('HI.'),   
        ('BONJOUR -'),
        ('hOLa')
), pre_ta as (
    select chaine, type,
        regexp_substr(lower(chaine), '[a-z]+') as rep
    from table_a
), pre_tb as (
    select 
        *
        ,regexp_substr(lower(chaine), '[a-z]+') as rep
    from table_b
)
SELECT ta.*
    ,tb.*
    ,ta.rep = tb.rep
FROM pre_ta as ta
JOIN pre_tb as tb 
    ON ta.TYPE ='ENGLISH'

and if multi-match is needed I would use SPLIT_TO_TABLE:

with TABLE_A(chaine, type) as (
    select * from values
        ('heLLO','ENGLISH'),
        ('bob heLLO','ENGLISH'),
        ('HI','ENGLISH'),
        ('bONJOUR','FRENCH'),
        ('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
    select * from values
        ('HELLO *'), 
        ('HI. cat*'),   
        ('BONJOUR -'),
        ('hOLa')
), pre_ta as (
    select t.chaine, t.type,
        regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
    from table_a as t, table(split_to_table(chaine, ' ')) s
    where rep <> ''
), pre_tb as (
    select 
        *
        ,regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
    from table_b as t, table(split_to_table(chaine, ' ')) s
    where rep <> ''
)
SELECT ta.*
    ,tb.*
    ,ta.rep = tb.rep
FROM pre_ta as ta
JOIN pre_tb as tb 
    ON ta.TYPE ='ENGLISH'

which is now using an equi-join, thus putting that back into the ON clause:

with TABLE_A(chaine, type) as (
    select * from values
        ('heLLO','ENGLISH'),
        ('bob heLLO','ENGLISH'),
        ('HI','ENGLISH'),
        ('bONJOUR','FRENCH'),
        ('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
    select * from values
        ('HELLO *'), 
        ('HI. cat*'),   
        ('BONJOUR -'),
        ('hOLa')
), pre_ta as (
    select t.chaine, t.type,
        regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
    from table_a as t, table(split_to_table(chaine, ' ')) s
    where rep <> ''
), pre_tb as (
    select 
        *
        ,regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
    from table_b as t, table(split_to_table(chaine, ' ')) s
    where rep <> ''
)
SELECT ta.chaine,
    ta.type,
    tb.chaine
FROM pre_ta as ta
JOIN pre_tb as tb 
    ON ta.TYPE ='ENGLISH'
        AND ta.rep = tb.rep

gives:

CHAINE	TYPE	CHAINE_2
heLLO	ENGLISH	HELLO *
bob heLLO	ENGLISH	HELLO *
HI	ENGLISH	HI. cat*

How to use REGEXP_SUBSTR for searches between two tables with Snowflake?

1 Answers1