It's slow because you are breaking the golden rule "never use function in your WHERE clause", given an ON is just a special WHERE it still hold. If you want good performance in any database, you want to use EQUI Joins (aka a = b).
If we looking at what this toy example is doing you can see you join is really a CROSS JOIN with some expensive WHERE clauses:
with TABLE_A(chaine, type) as (
select * from values
('heLLO','ENGLISH'),
('HI','ENGLISH'),
('bONJOUR','FRENCH'),
('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
select * from values
('HELLO *'),
('HI.'),
('BONJOUR -'),
('hOLa')
), pre_tb as (
select
*
,REPLACE(CHAINE,'.','') as rep
,'.*\\b' || rep || '\\b.*' as pata
,'\\b' || rep || '\\b' as patb
from table_b
)
SELECT ta.*
,tb.*
,REGEXP_SUBSTr (ta.CHAINE , tb.pata ,1 ,1 ,'i') IS NOT NULL as reg_a
,REGEXP_SUBSTr (ta.CHAINE , tb.patb ,1 ,1 ,'i') IS NOT NULL as reg_b
FROM TABLE_A as ta
JOIN pre_tb as tb
ON ta.TYPE ='ENGLISH'
--AND REGEXP_SUBSTr (ta.CHAINE , tb.pat ,1 ,1 ,'i') IS NOT NULL
I applied the suggestion to update the REPLACE, but the above gives:
CHAINE |
TYPE |
CHAINE_2 |
REP |
PATA |
PATB |
REG_A |
REG_B |
heLLO |
ENGLISH |
HELLO * |
HELLO * |
.*\bHELLO \b. |
\bHELLO *\b |
TRUE |
TRUE |
heLLO |
ENGLISH |
HI. |
HI |
.\bHI\b. |
\bHI\b |
FALSE |
FALSE |
heLLO |
ENGLISH |
BONJOUR - |
BONJOUR - |
.\bBONJOUR -\b. |
\bBONJOUR -\b |
FALSE |
FALSE |
heLLO |
ENGLISH |
hOLa |
hOLa |
.\bhOLa\b. |
\bhOLa\b |
FALSE |
FALSE |
HI |
ENGLISH |
HELLO * |
HELLO * |
.*\bHELLO \b. |
\bHELLO *\b |
FALSE |
FALSE |
HI |
ENGLISH |
HI. |
HI |
.\bHI\b. |
\bHI\b |
TRUE |
TRUE |
HI |
ENGLISH |
BONJOUR - |
BONJOUR - |
.\bBONJOUR -\b. |
\bBONJOUR -\b |
FALSE |
FALSE |
HI |
ENGLISH |
hOLa |
hOLa |
.\bhOLa\b. |
\bhOLa\b |
FALSE |
FALSE |
so what this is showing is you don't need the wild .*
on the start/end of the REGEXP_SUBSTR. as REG_A and REG_B are equal results. But If also shows the start of pre-processing your data.
Also of note the HELLO *
only works because *
is valid regex syntax.. which makes this code rather dangerous. That stuff should really be stripped off, or correctly escaped.
The REGEX_SUBSTR can be replaced with REGEXP_LIKE because you really are only asking is there a match, and not want the result, thus don't ask for more than you want.
Thus I would be inclined to use code like this:
with TABLE_A(chaine, type) as (
select * from values
('heLLO','ENGLISH'),
('HI','ENGLISH'),
('bONJOUR','FRENCH'),
('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
select * from values
('HELLO *'),
('HI.'),
('BONJOUR -'),
('hOLa')
), pre_ta as (
select chaine, lower(chaine) as l_chaine, type
from table_a
), pre_tb as (
select
*
,regexp_substr(lower(CHAINE), '[a-z]+') as rep
,'\\b' || rep || '\\b' as pat
from table_b
)
SELECT ta.*
,tb.*
,REGEXP_LIKE (ta.CHAINE , tb.pat) as reg
FROM pre_ta as ta
JOIN pre_tb as tb
ON ta.TYPE ='ENGLISH'
and if you are just want the first match to work:
with TABLE_A(chaine, type) as (
select * from values
('heLLO','ENGLISH'),
('HI','ENGLISH'),
('bONJOUR','FRENCH'),
('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
select * from values
('HELLO *'),
('HI.'),
('BONJOUR -'),
('hOLa')
), pre_ta as (
select chaine, type,
regexp_substr(lower(chaine), '[a-z]+') as rep
from table_a
), pre_tb as (
select
*
,regexp_substr(lower(chaine), '[a-z]+') as rep
from table_b
)
SELECT ta.*
,tb.*
,ta.rep = tb.rep
FROM pre_ta as ta
JOIN pre_tb as tb
ON ta.TYPE ='ENGLISH'
and if multi-match is needed I would use SPLIT_TO_TABLE:
with TABLE_A(chaine, type) as (
select * from values
('heLLO','ENGLISH'),
('bob heLLO','ENGLISH'),
('HI','ENGLISH'),
('bONJOUR','FRENCH'),
('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
select * from values
('HELLO *'),
('HI. cat*'),
('BONJOUR -'),
('hOLa')
), pre_ta as (
select t.chaine, t.type,
regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
from table_a as t, table(split_to_table(chaine, ' ')) s
where rep <> ''
), pre_tb as (
select
*
,regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
from table_b as t, table(split_to_table(chaine, ' ')) s
where rep <> ''
)
SELECT ta.*
,tb.*
,ta.rep = tb.rep
FROM pre_ta as ta
JOIN pre_tb as tb
ON ta.TYPE ='ENGLISH'
which is now using an equi-join, thus putting that back into the ON clause:
with TABLE_A(chaine, type) as (
select * from values
('heLLO','ENGLISH'),
('bob heLLO','ENGLISH'),
('HI','ENGLISH'),
('bONJOUR','FRENCH'),
('hOLa','SPANISH')
), TABLE_B(CHAINE) as (
select * from values
('HELLO *'),
('HI. cat*'),
('BONJOUR -'),
('hOLa')
), pre_ta as (
select t.chaine, t.type,
regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
from table_a as t, table(split_to_table(chaine, ' ')) s
where rep <> ''
), pre_tb as (
select
*
,regexp_substr(lower(trim(s.value)), '[a-z]+') as rep
from table_b as t, table(split_to_table(chaine, ' ')) s
where rep <> ''
)
SELECT ta.chaine,
ta.type,
tb.chaine
FROM pre_ta as ta
JOIN pre_tb as tb
ON ta.TYPE ='ENGLISH'
AND ta.rep = tb.rep
gives:
CHAINE |
TYPE |
CHAINE_2 |
heLLO |
ENGLISH |
HELLO * |
bob heLLO |
ENGLISH |
HELLO * |
HI |
ENGLISH |
HI. cat* |