0

I was trying to concatenate distinct values of rows using LISTAGG and REGEXP_REPLACE functions as described in this link: LISTAGG in Oracle to return distinct values.

I am trying to understand what is going on, I made some changes and started playing with it. However, I wasn't getting the answer I was expecting. It seems like I'm missing some key concept on how "empty string" and "space" are treated by the regexp engine. How does the regexp engine process the why_space_affected_here, why_space_affected_here, why_space_not_affected_here, matching_empty_string, and why_space_not_affected columns in the below sample query?

select group_key
     , listagg(id, ', ') within group (order by id) as listagg_output`
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+', '\1') as why_space_affected_here
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(,\1)+', '\1') as why_space_not_affected_here
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+', 'T') as matching_empty_string
     , regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+($|,)', '\1\3') as why_space_not_affected
from (
    select 22 group_key, 1 id from dual
    union all
    select 22 group_key, 2 id from dual
    union all
    select 22 group_key,  3 id from dual
    union all
    select 22 group_key, 3 id from dual
    )
group by group_key;

I was expecting the spaces to be reserved for the why_space_affected_here column (the query given below): regexp_replace(listagg(id, ', ') within group (order by id), '([^,]*)(, \1)+', '\1') as why_space_affected_here.

halfer
  • 19,824
  • 17
  • 99
  • 186

1 Answers1

0
SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+', '\1') as why_space_affected_here
FROM DUAL

Outputs 123 because the first and second replacements ([^,]*) matches a zero-width string and only the comma and space are replaced with a zero-width string and then in the third replacement ([^,]*) matches 3 and 3, 3 is replace by just 3.

SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(,\1)+', '\1') as why_space_not_affected_here
FROM DUAL

Outputs 1 2 3 as the first and second replacements ([^,]*) matches a zero-width string and only the commas are replaced and the third replacement ([^,]*) matches space then 3 and the , 3 is removed.

SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+', 'T') as matching_empty_string
FROM DUAL

Outputs 1T2TT - this is exactly the same as your first example but instead of keeping the first capture group you're replacing everything with T.

SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+($|,)', '\1\3') as why_space_not_affected
FROM DUAL

Outputs 1, 2, 3 as its first match is to 3, 3 when ([^,]*) matches 3 and ($|,) matches the end of the string.


Can you please describe or walk me through how the first and second replacements matches a zero-width string?

For the regular expression ([^,]*)(, \1)+ then [^,] matches a non-comma chatacter. [^,]* matches zero-or-more non-comma characters.

  • From the start of the string:
    1, 2, 3, 3
    ^^^^
    
    ([^,]*) could match 1 but it is succeeded by , 2 and not by , 1 so this will not match the entire pattern.
  • Still at the start of the string:

    1, 2, 3, 3
    ^
    

    ([^,]*) could match a zero-width (empty) sub-string but the next character is a 1 and not a , so this will not match the entire pattern and having tried all possibile matches the pattern matcher will move on to testing starting from the next character.

  • Moving on the starting from the next character:

    1, 2, 3, 3
     ^^
    

    ([^,]*) cannot match the comma but it can match a zero-width (empty) string which is succeeded by comma then space and then a repeat of the first match, a zero-width sub-string. So the pattern will successfully match just a comma and a space.

  • Starting from 2 we effectively repeat the previous steps and end up skipping the 2 character and matching a zero-width sub-string then comma then space then the repeated zero-width sub-string.
MT0
  • 143,790
  • 11
  • 59
  • 117
  • Thank you for your answer, @MT0! I get how the third replacement `([^,]*)` matches 3 and 3, 3 is replace by just 3. What I'm not getting is how the first and second replacements `([^,]*)` matches a zero-width string and only the comma and space are replaced with a zero-width string in the below query: `SELECT REGEXP_REPLACE( '1, 2, 3, 3', '([^,]*)(, \1)+', 'T') as matching_empty_string FROM DUAL`. Can you please describe or walk me through how the first and second replacements matches a zero-width string? Sorry, this is a little foreign to me. – user11816100 Jul 22 '19 at 19:19