How is empty string and space treated in regular expression?

Question

I was trying to concatenate distinct values of rows using LISTAGG and REGEXP_REPLACE functions as described in this link: LISTAGG in Oracle to return distinct values.

I am trying to understand what is going on, I made some changes and started playing with it. However, I wasn't getting the answer I was expecting. It seems like I'm missing some key concept on how "empty string" and "space" are treated by the regexp engine. How does the regexp engine process the why_space_affected_here, why_space_affected_here, why_space_not_affected_here, matching_empty_string, and why_space_not_affected columns in the below sample query?

select group_key
     , listagg(id, ', ') within group (order by id) as listagg_output`
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+', '\1') as why_space_affected_here
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(,\1)+', '\1') as why_space_not_affected_here
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+', 'T') as matching_empty_string
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+($|,)', '\1\3') as why_space_not_affected
from (
    select 22 group_key, 1 id from dual
    union all
    select 22 group_key, 2 id from dual
    union all
    select 22 group_key,  3 id from dual
    union all
    select 22 group_key, 3 id from dual
    )
group by group_key;

I was expecting the spaces to be reserved for the why_space_affected_here column (the query given below): regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+', '\1') as why_space_affected_here.

Add sample data. Add expected output. Try to shorten your query. — user14063792468, Jul 21 '19 at 22:15

MT0 · Answer 1 · 2019-07-23T08:41:10.527

SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+', '\1') as why_space_affected_here
FROM DUAL

Outputs 123 because the first and second replacements ([^,]*) matches a zero-width string and only the comma and space are replaced with a zero-width string and then in the third replacement ([^,]*) matches 3 and 3, 3 is replace by just 3.

SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(,\1)+', '\1') as why_space_not_affected_here
FROM DUAL

Outputs 1 2 3 as the first and second replacements ([^,]*) matches a zero-width string and only the commas are replaced and the third replacement ([^,]*) matches space then 3 and the , 3 is removed.

SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+', 'T') as matching_empty_string
FROM DUAL

Outputs 1T2TT - this is exactly the same as your first example but instead of keeping the first capture group you're replacing everything with T.

SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+($|,)', '\1\3') as why_space_not_affected
FROM DUAL

Outputs 1, 2, 3 as its first match is to 3, 3 when ([^,]*) matches 3 and ($|,) matches the end of the string.

Can you please describe or walk me through how the first and second replacements matches a zero-width string?

For the regular expression ([^,]*)(, \1)+ then [^,] matches a non-comma chatacter. [^,]* matches zero-or-more non-comma characters.

From the start of the string:
```
1, 2, 3, 3
^^^^
```
([^,]*) could match 1 but it is succeeded by , 2 and not by , 1 so this will not match the entire pattern.
Still at the start of the string:
```
1, 2, 3, 3
^
```
([^,]*) could match a zero-width (empty) sub-string but the next character is a 1 and not a , so this will not match the entire pattern and having tried all possibile matches the pattern matcher will move on to testing starting from the next character.
Moving on the starting from the next character:
```
1, 2, 3, 3
 ^^
```
([^,]*) cannot match the comma but it can match a zero-width (empty) string which is succeeded by comma then space and then a repeat of the first match, a zero-width sub-string. So the pattern will successfully match just a comma and a space.
Starting from 2 we effectively repeat the previous steps and end up skipping the 2 character and matching a zero-width sub-string then comma then space then the repeated zero-width sub-string.

Thank you for your answer, @MT0! I get how the third replacement `([^,]*)` matches 3 and 3, 3 is replace by just 3. What I'm not getting is how the first and second replacements `([^,]*)` matches a zero-width string and only the comma and space are replaced with a zero-width string in the below query: `SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+', 'T') as matching_empty_string FROM DUAL`. Can you please describe or walk me through how the first and second replacements matches a zero-width string? Sorry, this is a little foreign to me. — user11816100, Jul 22 '19 at 19:19

How is empty string and space treated in regular expression?

1 Answers1