2

I would like to match text between two strings, although the last string/character might not aways be available.

String1: 'www.mywebsite.com/search/keyword=toys'

String2: 'www.mywebsite.com/search/keyword=toys&lnk=hp1'

Here I want to match the value in keyword= that is 'toys' and I am using

(?<=keyword=)(.*)(?=&|$)

Works for String1 but for String2 it matches everything after '&'

What am I doing wrong?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
user7088181
  • 75
  • 1
  • 10
  • 1
    Does this answer your question? [In regex, match either the end of the string or a specific character](https://stackoverflow.com/questions/12083308/in-regex-match-either-the-end-of-the-string-or-a-specific-character) – Wiktor Stribiżew Feb 08 '22 at 08:11

2 Answers2

3

.* is greedy. It takes everything it can, therefore stops at the end of the string ($) and not at the & character.
Change it to its non-greedy version - .*?

with t as
(
    select  explode
            (
                array
                (
                    'www.mywebsite.com/search/keyword=toys'
                   ,'www.mywebsite.com/search/keyword=toys&lnk=hp1'
                )
            ) as (val)
)
select  regexp_extract(val,'(?<=keyword=)(.*?)(?=&|$)',0)
from    t
;

+------+
| toys |
+------+
| toys |
+------+
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
1

You do not need to bother with greediness when you need to match zero or more occurrences of any characters but a specific character (or set of characters). All you need is to get rid of the lookahead and the dot pattern and use [^&]* (or, if the value you expect should not be an empty string, [^&]+):

(?<=keyword=)[^&]+

Code:

select regexp_extract(val,'(?<=keyword=)[^&]+', 0) from t

See the regex demo

Note you do not even need a capturing group since the 0 argument instructs regexp_extract to retrieve the value of the whole match.

Pattern details

  • (?<=keyword=) - a positive lookbehind that matches a location that is immediately preceded with keyword=
  • [^&]+ - any 1+ chars other than & (if you use * instead of +, it will match 0 or more occurrences).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563