REGEXP_SUBSTR syntax not making sense... Examples within:

Question

I am finding out very quickly that the REGEXP_SUBSTR help online is 100% not resonating with me (Oracle SQL). I believe I have a relatively simple use case, but I'm having a hard time tracking how one REG_SUBSTR argument I'm using is giving me expected results, and why one eludes me. Any help on the use case at hand would be appreciated, as well as a plain language approach to how maybe I could better wrap my head around using, what appears to be a very useful SQL tool.

I have a characteristic value field that has 2 pieces of information I need to extract. Each field will have the following 'A#,P#' - where the # could be 1 digit, all the way to 6 or 7 digits long. I need to extract each of these digits following the 'A', and the 'P'

SO - my first REGEXP_SUBSTR gives me expected results:

SELECT REGEXP_SUBSTR(char_field, '[^,]+',2) a_units

char_field  =>   a_units         p_units
A5,P0               5               0
A15,P1              15              1
A6,P12              6              12
A0,P5               0               5

That seems like it does the trick with extracting the entirety of the values following the 'A' in the field. But, the logic doesn't track for me as to why this works - is the '^' only grabbing information to the left of the comma, and after the 2nd character (',2'). Why is the (+) sign there? Could I use the (-) sign to garner anything different? Could someone please explain to me WHY this works, and how I might do the same thing - extract the number(s) after the 'P' in the field? Thanks in advance!

In addition to not understanding why the logic of the a_units REGEXP_SUBSTR formula works for what I need, I'm stuck with how to do the same with p_units. I have tried multiple iterations of changing the order of the [,^] or the number following the comma. Oracle documentation doesn't seem to have examples of this use case available, and the existing documentation throws me for a loop when trying to learn it.

if you don't store data that way and instead normalize your datastructure, you will have less problems. i believe that you only must read the manual to understand the deterministic logic behind it — nbk, Mar 09 '23 at 00:17
`^` in `[^]` means everything except `` (in your case, everything except the comma between `A` and `P`). You cannot move it `^` in that expression. The `+` applies to what is directly left to it and means "at least 1 time". I suggest you visit [regex101](https://regex101.com/); you can test a regexp against test strings (middle of the webpage), it provides explanations (right pane) and a quick reference (bottom right) to learn more. With all of that said, I agree with @nbk. You could have at least split your field in 2 and used `substr`. — Atmo, Mar 09 '23 at 00:32

Junjie · Answer 1 · 2023-03-09T14:37:57.513

1

The regex [^,]+ in your example mean matching any non-',' character for length 1+.

REGEXP_SUBSTR(char_field, '[^,]+',2) means matching char_filed on regex [^,]+ starting from the 2nd character.

This won't work for 2nd part of matching your need.

Here is the regex match which meets your need (regex101 verification):

\d+

It means matching a list of digit with length 1+

Here is the final query you want:

SELECT
    char_field, 
    REGEXP_SUBSTR(char_field, '\d+',1,1) AS a_units, 
    REGEXP_SUBSTR(char_field, '\d+',1,2) AS p_units
FROM
    input_table

char_field	a_units	p_units
A5,P0	5	0
A15,P1	15	1
A6,P12	6	12
A0,P5	0	5

edited Mar 09 '23 at 14:37

answered Mar 09 '23 at 00:38

Junjie

1,170
1
2
6

Alas, Oracle's implementation of regex does not allow for positive lookbehinds. – Gary_W Mar 09 '23 at 14:13
Thanks, @Gary_W. I now update the answer to use regex without lookbehinds. – Junjie Mar 09 '23 at 14:30

score 1 · Answer 2 · answered Mar 09 '23 at 19:34

While jonathan200's answer is correct based on your example data, it depends on the data in char_field being previously validated to be in the correct format as if the 'A' element does not have a numeric portion, B's number would be returned as A's number thus giving incorrect results. So, depending on how this data was previously entered and validated you may or not have an issue. However, if this data came from a user input on a form or from unscrubbed input then I would suggest you tighten up the regex to further allow only what you expect. But then you need to decide how to detect and handle that.

Consider this query using a WITH statement (known as a CTE or Common Table Expression) to set up some test data. Always use test data to set up unexpected data formats when testing. Instead of just looking for numbers, the regex for a_units matches a string starting with an 'A' followed by a group of 1 to 7 digits followed by a group consisting of a comma and returns the first group (the digits). If you don't care about the number of digits after the 'A', then replace '{1,7}' with a '+' which means 1 or more. NOTE that regexp_substr returns NULL if the match is not found.

with tbl(id, char_field) as (
  select 1, 'A5,P0' from dual union all
  select 2, 'A15,P1' from dual union all
  select 3, 'A6,P12' from dual union all
  select 4, 'A1234567,P1234567' from dual union all
  select 5, 'A1234567890,P0987654321' from dual union all
  select 6, NULL from dual union all
  select 7, 'A,P' from dual union all
  select 8, 'A,P0' from dual union all
  select 9, 'P12, A877' from dual
  )
SELECT id, char_field,
    REGEXP_SUBSTR(char_field, 'A(\d{1,7})(,)', 1, 1, NULL, 1) AS a_units, 
    REGEXP_SUBSTR(char_field, 'P(\d{1,7})($)', 1, 1, NULL, 1) AS p_units
FROM tbl
order by id;

ID CHAR_FIELD              A_UNITS                 P_UNITS                
-- ----------------------- ----------------------- -----------------------
 1 A5,P0                   5                       0                      
 2 A15,P1                  15                      1                      
 3 A6,P12                  6                       12                     
 4 A1234567,P1234567       1234567                 1234567                
 5 A1234567890,P0987654321                                                
 6                                                                        
 7 A,P                                                                    
 8 A,P0                                            0                      
 9 P12, A877                                                              

9 rows selected.

If you need to detect if a match is not found, you could use regexp_replace, which returns the original string if the match is not found, then compare it to char_field and if they are equal and not null, the match was not found. With regex_replace, the '\1' means replace the match with the first grouping.

with tbl(id, char_field) as (
  select 1, 'A5,P0' from dual union all
  select 2, 'A15,P1' from dual union all
  select 3, 'A6,P12' from dual union all
  select 4, 'A1234567,P1234567' from dual union all
  select 5, 'A1234567890,P0987654321' from dual union all
  select 6, NULL from dual union all
  select 7, 'A,P' from dual union all
  select 8, 'A,P0' from dual union all
  select 9, 'P12, A877' from dual
  )
SELECT id, char_field,
    REGEXP_REPLACE(char_field, 'A(\d{1,7}),.*', '\1') AS a_units, 
    REGEXP_REPLACE(char_field, '.*P(\d{1,7})$', '\1') AS p_units,
    case 
      when REGEXP_REPLACE(char_field, 'A(\d{1,7}),.*', '\1') is not null AND char_field = REGEXP_REPLACE(char_field, 'A(\d{1,7}),.*', '\1') then
        'A NO MATCH'
      when REGEXP_REPLACE(char_field, 'A(\d{1,7}),.*', '\1') is not null AND char_field != REGEXP_REPLACE(char_field, 'A(\d{1,7}),.*', '\1') then
        'A Matched'
      else
        'UNKNOWN'
     end as match
FROM tbl
order by id;


ID CHAR_FIELD              A_UNITS                 P_UNITS                 MATCH     
-- ----------------------- ----------------------- ----------------------- ----------
 1 A5,P0                   5                       0                       A Matched 
 2 A15,P1                  15                      1                       A Matched 
 3 A6,P12                  6                       12                      A Matched 
 4 A1234567,P1234567       1234567                 1234567                 A Matched 
 5 A1234567890,P0987654321 A1234567890,P0987654321 A1234567890,P0987654321 A NO MATCH
 6                                                                         UNKNOWN   
 7 A,P                     A,P                     A,P                     A NO MATCH
 8 A,P0                    A,P0                    0                       A NO MATCH
 9 P12, A877               P12, A877               P12, A877               A NO MATCH

9 rows selected.

So anyway, my purpose with all this is to highlight the fact that you need to include even unexpected test conditions, decide how you need to detect and handle exceptions (or even if you have to at all), and know that regexp_substr and regexp_replace return different values if the match is not found so perhaps one is more appropriate to use than the other depending on your circumstances. Stuff to keep in the back of your mind when developing your queries.

REGEXP_SUBSTR syntax not making sense... Examples within:

2 Answers2