How to remove duplicate html tag from text in regex oracle

Question

I need help in removing outer tag from text .Since it is not text but Html tag it is not working as per given solution: original text :

<HIDE><HIDE>Anti-</HIDE></HIDE>Skull

i need to remove outer and tag from text .

which may turn into :

<HIDE>Anti-</HIDE>Skull

Show what you have tried so far and where you are stuck. So far, it is just a duplicate of: https://stackoverflow.com/questions/26672269/how-to-remove-duplicates-from-comma-separated-list-by-regex-in-oracle-regexp-rep — Malte Hartwig, May 31 '17 at 10:05
Select regexp_replace('Anti-Skull', '/(.*)\1', '/\1') As checktext From dual — Animesh Sheolikar, May 31 '17 at 10:21
select REGEXP_REPLACE('Anti-Skull','([^,]+)(,[ ]*\1)+') as regigex from dual; — Animesh Sheolikar, May 31 '17 at 10:21
but i want to remove ony extra and from string.here getting problem. — Animesh Sheolikar, May 31 '17 at 10:22
When I tried your first regex, I removed the slashes and added `<>` and it worked: `'(<.*>)\1', '\1'`. Result was `Anti-Skull`. Does that solve your problem? — Malte Hartwig, May 31 '17 at 11:55

Malte Hartwig · Answer 1 · 2017-06-01T12:55:19.860

Just to wrap up what we discussed in the comments:

SELECT regexp_replace('<HIDE><HIDE>Anti-</HIDE></HIDE>Skull', '(<.*>)\1+', '\1') AS checktext FROM dual

regexp_replace replaces part of the input that matches a regular expression with another string. If you use (<.*>)\1+ as the regex, it will match strings that start and end with <>, followed by the same thing again multiple times. Using \1 as the replacement string will then replace the whole thing with just one copy of that string. If you have trouble to understand how the regex works, please read about "capturing groups".

Please read MT0's answer for an approach that is more robust in case your html becomes more complicated.

MT0 · Answer 2 · 2017-05-31T12:01:23.203

You can do this using XML functions:

WITH data ( xmlstring ) AS (
  SELECT '<html><body><HIDE><HIDE>Anti-</HIDE></HIDE>Skull</body></html>' FROM DUAL
)
SELECT UPDATEXML(
         xml,
         '//HIDE/HIDE/..',
         XMLQUERY( '//HIDE/HIDE' PASSING xml RETURNING CONTENT )
       ).getClobVal() AS updatedXML
FROM   ( SELECT XMLTYPE( xmlstring ) AS xml FROM data );

Outputs:

<html><body><HIDE>Anti-</HIDE>Skull</body></html>

You should also be able to use the more generic XPath '//*[name()=../name()]/..'; however, while XMLQUERY parses it successfully, UPDATEXML does not.

Update:

WITH data ( xmlstring ) AS (
  SELECT '<HIDE><HIDE>Anti-</HIDE></HIDE>Skull' FROM DUAL
)
SELECT SUBSTR(
         updatedXML,
         7,
         LENGTH( updatedXML ) - 13
       )
FROM   (
  SELECT UPDATEXML(
           xml,
           '//HIDE/HIDE/..',
           XMLQUERY( '//HIDE/HIDE' PASSING xml RETURNING CONTENT )
         ).getClobVal() AS updatedXML
  FROM   (
    SELECT XMLTYPE( '<root>' || xmlstring || '</root>' ) AS xml
    FROM data
  )
);

Update 2:

Before using a regular expression, you should understand the limitations - a regular expression cannot parse arbitrary HTML but will only work on a limited subset. In this case it will not match attributes or match the correct level of tags (and you should use an XML parser for that):

WITH data ( html ) AS (
  SELECT '<HIDE><HIDE>Anti-</HIDE></HIDE>Skull' FROM DUAL UNION ALL
  SELECT '<HIDE><HIDE>first</HIDE></HIDE>between<HIDE><HIDE>second</HIDE></HIDE>' FROM DUAL UNION ALL
  SELECT '<HIDE><HIDE>before<HIDE><HIDE>inner</HIDE></HIDE>after</HIDE></HIDE>outer' FROM DUAL
)
SELECT REGEXP_REPLACE(
         html,
         '<([A-Za-z][A-Za-z0-9-]*)><\1>(.*)</\1></\1>',
         '<\1>\2</\1>'
       )
FROM   data;

Output:

<HIDE>Anti-</HIDE>Skull
<HIDE>first</HIDE></HIDE>between<HIDE><HIDE>second</HIDE>
<HIDE>before<HIDE><HIDE>inner</HIDE></HIDE>after</HIDE>outer

The regular expression will match the simple case but you will find it difficult (or impossible) to correctly handle both the second and third cases.

sorry for the confusion but cant it be done using simple regex. — Animesh Sheolikar, May 31 '17 at 11:32
@AnimeshSheolikar I'm sure it could be done in the simple example but you should consider that the tags may have attributes and could have further descendants inside those that would complicate the regular expression. Such as `'beforeinnerafterother'` a regular expression will not match the start and end tags at the same level. You should probably read the second and third posts from [this question](https://stackoverflow.com/q/1732348/1509264) before you look for a regular expression that will solve you problem. — MT0, May 31 '17 at 11:38
I understand your concern regarding attribute but its only simple in given text no further descendants will be there. — Animesh Sheolikar, May 31 '17 at 11:44

How to remove duplicate html tag from text in regex oracle

2 Answers2