Substring using Oracle SQL regex

Question

I've created a regex which would capture the string I need. When I am testing regex on websites such as rubular.com then everything works, however when I put the same regex into REGEXP_SUBSTR function then it doesn't work.

Here are 2 SQL examples (one with text in English and another one in Kristaps Porzingis' language):

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">This is the text I would like to substr! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

SELECT regexp_substr('<ul data-error-code="REOPENED" data-unique-error-code="REOPENED"><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p class="tutorial" href="#">Other random text that I do not need</li></ul>'
                    ,'<li class="b">([\wāēīšžģņļčķū:!,\b\s]+)<') 
  FROM dual;

I am trying to select text between <li class="b"> and next html tag, which in this case is <p class="tutorial">.

Any advice on what am I doing wrong?

LukStorms · Answer 1 · 2017-06-15T13:35:25.813

You can simplify that regex.
Instead of looking for specific characters, look for characters that are not < or >

For example:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">([^<>]+)',1,1,'i',1) as b_class
FROM dual

Where [^<>] matches any character that's not < or >

Or you could lazy match the characters till the first <

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>'
                    ,'<li class="b">(.*?)<',1,1,'ni',1) as b_class
FROM dual

The .*? will try to consume characters till the first <
And by adding the match parameter n it'll also match if there's a multiline text after the tag.

'n' allows the period (.), which is the match-any-character character, to match the newline character. If you omit this parameter, the period does not match the newline character.

score 1 · Accepted Answer · answered Jun 15 '17 at 08:26

1

Parsing HTML with regex is not advisable, you'd better fetch the strings and parse them with a language that has convenient means to parse HTML.

If you only have Oracle DBMS at hand, for a one-off job, you may consider using the following regexp_substr:

SELECT regexp_substr('<ul><li class="b">Šī ir valoda, ko lielākā daļa no jums nesaprot! <p>Not needed</li></ul>',
      '<li\s+class="b">([^<]+)', 1, 1, NULL, 1) as RESULT from dual

See the REXTESTER demo:

Here,

<li\s+class="b"> - matches <li, 1+ whitespaces, class="b">literal substring
([^<]+) - captures into Group #1 one or more chars other than <

The last 1 argument lets you access the contents of this Group 1.

answered Jun 15 '17 at 08:26

Wiktor Stribiżew

607,720
39
448
563

1

Thank you. Unfortunately, for this job I have only Oracle DBMS. I completely agree with you that parsing HTML with regex must be done in other, more appropriate languages. – Alexander Gusev Jun 15 '17 at 08:45
@Alex FYI, depending on the version, the Oracle DBMS has build in methods to work with XML. For example [EXTRACT](http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions051.htm) which allows you to use an xpath. But yeah, HTML isn't exactly XML. – LukStorms Jun 15 '17 at 09:23

RayCW · Answer 3 · 2017-06-22T15:05:00.367

0

I would use Instr to search for the position of the first html tag and do a substring after that position to get the tail of the text. The next step is to search for '<' on this tail and to use substring again.

Something like

select substring(mytext, 1, instr(mytext, '<')) from 
(
 select substring(text, instr(text, '<li class="b">') + 
 length('<li class="b">') +1) as mytext from table
)

edited Jun 22 '17 at 15:05

answered Jun 22 '17 at 14:59

RayCW

174
2
10

Substring using Oracle SQL regex

3 Answers3