Consecutive Pattern replacing is not happening with REGEXP_REPLACE

Question

I have a string as below

Welcome to the world of the Hackers

I am trying to replace the occurrences of listed strings i.e. of,to,the in between the entire string using below query, but it's not working properly if the patterns are consecutive, it fails.

SELECT regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' ) 
FROM dual;

Output: Welcome the world the Hackers

Even if the pattern is repeating consecutive it is not working i.e.

SELECT regexp_replace( 'Welcome to to the world of the Hackers', '( to )|( the )|( of )', ' ' ) 
FROM dual;

Output: Welcome to world the Hackers

Whereas my expected output is: Welcome world Hackers

Is there any alternative/solution for this using REGEXP_REPLACE?

MT0 · Accepted Answer · 2018-06-05T14:28:46.007

You can use the regular expression (^|\s+)((to|the|of)(\s+|$))+:

SQL Fiddle

Query 1:

WITH test_data ( sentence ) AS (
  SELECT 'to the of' FROM DUAL UNION ALL
  SELECT 'woof breathe toto' FROM DUAL UNION ALL -- has all the words as sub-strings of words
  SELECT 'theory of the offer to total' FROM DUAL -- mix of words to replace and words starting with those words
)
SELECT sentence,
       regexp_replace(
         sentence,
         '(^|\s+)((to|the|of)(\s+|$))+',
         '\1'
       ) AS replaced
FROM   test_data

Results:

|                     SENTENCE |           REPLACED |
|------------------------------|--------------------|
|                    to the of |             (null) | -- All words replaced
|            woof breathe toto |  woof breathe toto |
| theory of the offer to total | theory offer total |

Why doesn't regexp_replace( 'Welcome to the world of the Hackers', '( to )|( the )|( of )', ' ' ) work with successive matches?

Because the regular expression parser will look for the second match after the end of the first match and will not include the already parsed part of the string or the replacement text when looking for subsequent matches.

So the first match will be:

 'Welcome to the world of the Hackers'
         ^^^^

The second match will look in the sub-string following that match

 'the world of the Hackers'
           ^^^^

The 'the ' at the start of the sub-string will not be matched as it has no leading space character (yes, there was a space before it but that was matched in the previous match and, yes, that match was replaced with a space but overlapping matches and matches on previous replacements are not how regular expressions work).

So the second match is the ' of ' in the middle of the remaining sub-string.

There will be no third match as the remaining un-parsed sub-string is:

'the Hackers'

and, again, the 'the ' is not matched as there is not leading space character to match.

Kaushik Nayak · Answer 2 · 2018-06-05T14:26:48.527

REGEXP_REPLACE does not match a second pattern which is a part of the already matched pattern. This is more apparent when you use the multi-pattern matching like |. Thus, you can't rely on spaces for word boundaries to match multiple patterns this way. One solution could be to split and combine the characters. This may not be the best way, but works nonetheless. I would be glad to know a better solution.

This also assumes that you are ok with single spaces in the combined string when it had more than one in the original string.Also, words ending with comma or semicolon aren't considered. You may enhance it using NOT REGEXP_LIKE instead of NOT IN for such cases.

WITH t (id,s)
AS (
    SELECT 1 , 'Welcome to the world of the Hackers, you told me these words at the'
      FROM DUAL
      UNION ALL
    SELECT 2, 'The second line.Welcome to the world of the Hackers, you told me these words at the'
    FROM DUAL
    )
SELECT LISTAGG(word, ' ') WITHIN
GROUP (
        ORDER BY w
        )
FROM (
    SELECT id,
          LEVEL AS w
        ,REGEXP_SUBSTR(s, '[^ ]+', 1, LEVEL) AS word
    FROM t CONNECT BY LEVEL <= REGEXP_COUNT(s, '[^ ]+')
   AND PRIOR id = id 
  AND PRIOR SYS_GUID() IS NOT NULL

    )
WHERE lower(word) NOT IN (
        'to'
        ,'the'
        ,'of'
        )
        GROUP BY id;

Demo

Consecutive Pattern replacing is not happening with REGEXP_REPLACE

2 Answers2

Linked