1

Could anyone (with extensive experience in regular-expression matching) please clarify for me why the following query returns (what I consider) unexpected results in Oracle 12?

select regexp_substr('My email: test@tes6t.test', '[^@:space:]+@[^@:space:]+') 
from dual;

Expected result: test@tes6t.test

Actual result: t@t

Another example:

select regexp_substr('Beneficiary email: super+test.media.beneficiary1@gmail.com', '[^@:space:]+@[^@:space:]+') 
from dual;

Expected result: super+test.media.beneficiary1@gmail.com

Actual result: ry1@gm

EDIT: I double-checked and this is not related to Oracle SQL, but the same behaviour applies to any regex engine. Even when simplifying the regex to [^:space:]+@[^:space:]+ the results are the same. I am curious to know why it does not match all the non-whitespace characters before and after the @ sign. And why sometimes it matches 1 character, other times 2 or 3 or more characters, but not all.

Sorin Postelnicu
  • 1,271
  • 1
  • 10
  • 15

2 Answers2

3

The POSIX shortcut you are trying to use is incorrect, you need square brackets around it:

SELECT REGEXP_SUBSTR('Beneficiary email: super+test.media.beneficiary1@gmail.com', '[^@[:space:]]+@[^@[:space:]]+') 
FROM dual;

or even simpler, assuming you only want to validate by checking for an '@' and the email address is always at the end of the string, after the last space:

WITH tbl(str) AS (
  SELECT 'My email: test@tes6t.test' FROM dual UNION ALL
  SELECT 'Beneficiary email: super+test.media.beneficiary1@gmail.com' FROM dual
)
SELECT REGEXP_REPLACE(str, '.* (.*@.*)', '\1')
from tbl
;

Note: REGEXP_REPLACE() will return the original string if the match is not found, where REGEXP_SUBSTR() will return NULL. Keep that in mind and handle no match found accordingly. Always expect the unexpected!

Gary_W
  • 9,933
  • 1
  • 22
  • 40
  • Thank you! That was so stupid of me. But in my defense, it was not clear in the documentation that the POSIX classes must be enclosed in their own brackets. – Sorin Postelnicu Jul 14 '20 at 20:12
  • Also thanks for the example with the differences in behaviour between regexp_replace vs regexp_substr. In my case the emails do not always appear at the end of the text, that was only an example. But thanks! – Sorin Postelnicu Jul 14 '20 at 20:13
0

The REGEX is not correct in your SQL code. Try

select regexp_substr('Beneficiary email: super+test.media.beneficiary1@gmail.com', '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b') 
from dual;

select regexp_substr('My email: test@tes6t.test', '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b') 
from dual;

It gives the result that you expected.

MdBasha
  • 423
  • 4
  • 16
  • Thanks, but I am curious why is the initial regex not correct? Assuming I'm not looking for valid emails, but anything which is: (not-space one-or-more-times)@(not-space one-or-more-times). Even if I simplify the regex to [^:space:]+@[^:space:]+ I still get the same result. Why is it not matching all non-whitespace characters? – Sorin Postelnicu Jul 14 '20 at 13:55
  • Answer by @Gary_W suffices. – MdBasha Jul 14 '20 at 14:24
  • Thank you for the example of email regexp. Still, for some strange reason it does not work. Have you tried them? I suppose it's the word boundaries that have a different syntax in Oracle. I have also tried with double-escape \\b but still they don't work. – Sorin Postelnicu Jul 14 '20 at 20:15
  • Indeed, according to the excellent regular-expressions info website: "Word boundaries, as described above, are supported by most regular expression flavors. Notable exceptions are the POSIX and XML Schema flavors, which don’t support word boundaries at all." – Sorin Postelnicu Jul 14 '20 at 20:31
  • It seems that in Oracle this is the correct regexp: (^|\W)[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}($|\W) as seen here: https://stackoverflow.com/questions/7567700/oracle-regexp-like-and-word-boundaries – Sorin Postelnicu Jul 14 '20 at 22:17