0

I need help in removing outer tag from text .Since it is not text but Html tag it is not working as per given solution: original text :

<HIDE><HIDE>Anti-</HIDE></HIDE>Skull

i need to remove outer and tag from text .

which may turn into :

<HIDE>Anti-</HIDE>Skull

2 Answers2

1

Just to wrap up what we discussed in the comments:

SELECT regexp_replace('<HIDE><HIDE>Anti-</HIDE></HIDE>Skull', '(<.*>)\1+', '\1') AS checktext FROM dual

regexp_replace replaces part of the input that matches a regular expression with another string. If you use (<.*>)\1+ as the regex, it will match strings that start and end with <>, followed by the same thing again multiple times. Using \1 as the replacement string will then replace the whole thing with just one copy of that string. If you have trouble to understand how the regex works, please read about "capturing groups".

Please read MT0's answer for an approach that is more robust in case your html becomes more complicated.

Malte Hartwig
  • 4,477
  • 2
  • 14
  • 30
0

You can do this using XML functions:

WITH data ( xmlstring ) AS (
  SELECT '<html><body><HIDE><HIDE>Anti-</HIDE></HIDE>Skull</body></html>' FROM DUAL
)
SELECT UPDATEXML(
         xml,
         '//HIDE/HIDE/..',
         XMLQUERY( '//HIDE/HIDE' PASSING xml RETURNING CONTENT )
       ).getClobVal() AS updatedXML
FROM   ( SELECT XMLTYPE( xmlstring ) AS xml FROM data );

Outputs:

<html><body><HIDE>Anti-</HIDE>Skull</body></html>

You should also be able to use the more generic XPath '//*[name()=../name()]/..'; however, while XMLQUERY parses it successfully, UPDATEXML does not.

Update:

WITH data ( xmlstring ) AS (
  SELECT '<HIDE><HIDE>Anti-</HIDE></HIDE>Skull' FROM DUAL
)
SELECT SUBSTR(
         updatedXML,
         7,
         LENGTH( updatedXML ) - 13
       )
FROM   (
  SELECT UPDATEXML(
           xml,
           '//HIDE/HIDE/..',
           XMLQUERY( '//HIDE/HIDE' PASSING xml RETURNING CONTENT )
         ).getClobVal() AS updatedXML
  FROM   (
    SELECT XMLTYPE( '<root>' || xmlstring || '</root>' ) AS xml
    FROM data
  )
);

Update 2:

Before using a regular expression, you should understand the limitations - a regular expression cannot parse arbitrary HTML but will only work on a limited subset. In this case it will not match attributes or match the correct level of tags (and you should use an XML parser for that):

WITH data ( html ) AS (
  SELECT '<HIDE><HIDE>Anti-</HIDE></HIDE>Skull' FROM DUAL UNION ALL
  SELECT '<HIDE><HIDE>first</HIDE></HIDE>between<HIDE><HIDE>second</HIDE></HIDE>' FROM DUAL UNION ALL
  SELECT '<HIDE><HIDE>before<HIDE><HIDE>inner</HIDE></HIDE>after</HIDE></HIDE>outer' FROM DUAL
)
SELECT REGEXP_REPLACE(
         html,
         '<([A-Za-z][A-Za-z0-9-]*)><\1>(.*)</\1></\1>',
         '<\1>\2</\1>'
       )
FROM   data;

Output:

<HIDE>Anti-</HIDE>Skull
<HIDE>first</HIDE></HIDE>between<HIDE><HIDE>second</HIDE>
<HIDE>before<HIDE><HIDE>inner</HIDE></HIDE>after</HIDE>outer

The regular expression will match the simple case but you will find it difficult (or impossible) to correctly handle both the second and third cases.

MT0
  • 143,790
  • 11
  • 59
  • 117
  • sorry for the confusion but cant it be done using simple regex. – Animesh Sheolikar May 31 '17 at 11:32
  • @AnimeshSheolikar I'm sure it could be done in the simple example but you should consider that the tags may have attributes and could have further descendants inside those that would complicate the regular expression. Such as `'beforeinnerafterother'` a regular expression will not match the start and end tags at the same level. You should probably read the second and third posts from [this question](https://stackoverflow.com/q/1732348/1509264) before you look for a regular expression that will solve you problem. – MT0 May 31 '17 at 11:38
  • I understand your concern regarding attribute but its only simple in given text no further descendants will be there. – Animesh Sheolikar May 31 '17 at 11:44
  • @AnimeshSheolikar Updated again – MT0 May 31 '17 at 12:02