1

I've created a regex which would capture the string I need. When I am testing regex on websites such as rubular.com then everything works, however when I put the same regex into REGEXP_SUBSTR function then it doesn't work.

Here are 2 SQL examples (one with text in English and another one in Kristaps Porzingis' language):

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">This is the text I would like to substr! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

I am trying to select text between <li class="b"> and next html tag, which in this case is <p class="tutorial">.

Any advice on what am I doing wrong?

Alexander Gusev
  • 295
  • 1
  • 9

3 Answers3

1

You can simplify that regex.
Instead of looking for specific characters, look for characters that are not < or >

For example:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">([^<>]+)',1,1,'i',1) as b_class
FROM dual

Where [^<>] matches any character that's not < or >

Or you could lazy match the characters till the first <

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">(.*?)<',1,1,'ni',1) as b_class
FROM dual

The .*? will try to consume characters till the first <
And by adding the match parameter n it'll also match if there's a multiline text after the tag.

'n' allows the period (.), which is the match-any-character character, to match the newline character. If you omit this parameter, the period does not match the newline character.

LukStorms
  • 28,916
  • 5
  • 31
  • 45
1

Parsing HTML with regex is not advisable, you'd better fetch the strings and parse them with a language that has convenient means to parse HTML.

If you only have Oracle DBMS at hand, for a one-off job, you may consider using the following regexp_substr:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>',
      '<li\s+class="b">([^<]+)', 1, 1, NULL, 1) as RESULT from dual

See the REXTESTER demo:

enter image description here

Here,

  • <li\s+class="b"> - matches <li, 1+ whitespaces, class="b">literal substring
  • ([^<]+) - captures into Group #1 one or more chars other than <

The last 1 argument lets you access the contents of this Group 1.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thank you. Unfortunately, for this job I have only Oracle DBMS. I completely agree with you that parsing HTML with regex must be done in other, more appropriate languages. – Alexander Gusev Jun 15 '17 at 08:45
  • @Alex FYI, depending on the version, the Oracle DBMS has build in methods to work with XML. For example [EXTRACT](http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions051.htm) which allows you to use an xpath. But yeah, HTML isn't exactly XML. – LukStorms Jun 15 '17 at 09:23
0

I would use Instr to search for the position of the first html tag and do a substring after that position to get the tail of the text. The next step is to search for '<' on this tail and to use substring again.

Something like

select substring(mytext, 1, instr(mytext, '<')) from 
(
 select substring(text, instr(text, '<li class="b">') + 
 length('<li class="b">') +1) as mytext from table
)
RayCW
  • 174
  • 2
  • 10