The phrase, "substring which follows logic to substring from the right side, after the third _
," is difficult to understand. I infer that the intended behavior is to extract the subtring that includes all text before the third _
from the right and not all text before the third _
from the left. When searching from the left, there is also ambiguity as to expected behavior if there are less than three _
characters: should the entire string be returned or should the result be NULL
? Unfortunately, the example input, 22D_XYZ_xy_100_xyz_123
, and expected output, 22D_XYZ_xy
, are consistent with all of these cases. The following query demonstrates all three interpretations:
SELECT
string,
regexp_substr(string, '^.*(?=(_[^_]*){3}$)') as from_right,
regexp_substr(string, '^([^_]*_){0,2}[^_]*') as from_left,
regexp_substr(string, '^([^_]*_){2}[^_]*(?=_)') as from_left_requires_3
FROM (
VALUES ('22D_XYZ_xy_100_xyz_123'),
('1_2_3_4_5_6_7'),
('1_2_3_4_5_6'),
('1_2_3_4_5_'),
('_1_2_3_4_5'),
('1_2_3_4_5'),
('1_2_3_4'),
('1_2_3'),
('1_2_'),
('1_2'),
('1'),
('____'),
('___'),
('__'),
('_'),
(''),
(NULL)) t(string);
Executing the query gives the following:
string |
from_right |
from_left |
from_left_requires_3 |
'22D_XYZ_xy_100_xyz_123' |
'22D_XYZ_xy' |
'22D_XYZ_xy' |
'22D_XYZ_xy' |
'1_2_3_4_5_6_7' |
'1_2_3_4' |
'1_2_3' |
'1_2_3' |
'1_2_3_4_5_6' |
'1_2_3' |
'1_2_3' |
'1_2_3' |
'1_2_3_4_5_' |
'1_2_3' |
'1_2_3' |
'1_2_3' |
'_1_2_3_4_5' |
'_1_2' |
'_1_2' |
'_1_2' |
'1_2_3_4_5' |
'1_2' |
'1_2_3' |
'1_2_3' |
'1_2_3_4' |
'1' |
'1_2_3' |
'1_2_3' |
'1_2_3' |
NULL |
'1_2_3' |
NULL |
'1_2_' |
NULL |
'1_2_' |
NULL |
'1_2' |
NULL |
'1_2' |
NULL |
'1' |
NULL |
'1' |
NULL |
'____' |
'_' |
'__' |
'__' |
'___' |
'' |
'__' |
'__' |
'__' |
NULL |
'__' |
NULL |
'_' |
NULL |
'_' |
NULL |
'' |
NULL |
'' |
NULL |
NULL |
NULL |
NULL |
NULL |
To search from the right, a positive lookahead zero-length assertion, (?=(_[^_]*){3}$)
, is used to check that the unmatched portion of the string begins with _
and includes exactly three _
characters. If the assertion is false, then NULL
is returned.
To find the left substring with no more than two _
characters, the regular expression matches up to two substings of zero or more non-_
characters followed by a single _
along with any non-_
characters prior to the next _
or the end of the string.
Finding the left substring preceding a required third _
is similar to finding the left substring with no more than two _
characters, except that the occurrence specifier for the group is {2}
instead of {0,2}
and a positive lookahead zero-lengh assertion, (?=_)
, ensures that _
is the first unmatched character. As with the search from the right, if the assertion is false, then NULL is returned.