0

I have this regular expression

/\[url=(?:")?(.*?)(?:")?\](.*?)\[\/url\]/mi

and these blocks of text

[url=/someurl?page=5#3467]First[/url][postquote=true]
[url=/another_url/who-is?page=4#3396] Second[/url]
Some text[url=/another_url/who-is?page=3][i]3[/i] Third [/url]

and the regex works great at extracting the urls and text between the urls

Match 1

1.  /someurl?page=5#3467
2.  First

Match 2

1.  /another_url/who-is?page=4#3396
2.  Second

Match 3

1.  /another_url/who-is?page=3
2.  [i]3[/i] Third

The problem happens when I use the same regex from above to try to extract the url from this text

This is some text [url=https://www.somesite.com/location/?opt[]=apples]Link Name[/url]

Match 1

1.  https://www.somesite.com/location/?opt[
2.  =apples]Link Name

Notice the =apples] in the second match. What I need is the matched first match to include that in the url like

  1. https://www.somesite.com/location/?opt[]=apples
  2. Link Name

I have tried many modifications to this regex and no go so far, any help would be appreciated.

Matt Elhotiby
  • 43,028
  • 85
  • 218
  • 321

1 Answers1

1

Ruby regex has the duplicate named capture feature. With this feature, you can handle the two cases easily (the one with &quote; and the other). You don't have to use a recursive pattern since I doubt that [] can be nested in the query part of a url:

/\[url=(?:&quote;(?<url>[^&]*(?:&(?!quote;)[^&]*)*)&quote;|(?<url>[^\s\]\[]*(?:\[\][^\s\]\[]*)*))\](?<text>.*?)\[\/url\]/mi

the url is in the named group url and the content between tags is in the named group text.

in a more readable format:

/

\[url=
(?:
    &quote; (?<url> [^&]* (?:&(?!quote;)[^&]*)* ) &quote;
  |
    (?<url> [^\s\]\[]* (?:\[\][^\s\]\[]*)* )
)
\]
(?<text>.*?)\[\/url\]

/mix
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125