0

I have urls in this format:-

/scan/anything/se=hello-world/se=word.html
/scan/anything/se=hello-world/se=1.5/
/scan/anything/se=temp-2.5/se=1.5.html

I'm trying to match word characters after each se= plus the dash and decimal and capture them.

The regex I have come up with is this:-

  ^/scan/.*?se=([\w-.]*)/?(?:se=)([\w-.]*)/?(?:.html)?

Because I have added a dot(.) in the character class to match the decimal point it also matches .html so captures word.html and 1.5.html rather than just "word" and "1.5" from urls 1 and 3, how can I stop it matching .html I've tried various negations but none seem to work.

Desired output:

  • hello-world and word
  • hello-world and 1.5
  • temp-2.5 and 1.5
sniperd
  • 5,124
  • 6
  • 28
  • 44
Andrew Smith
  • 117
  • 9
  • What are the exact desired outputs? – revo Nov 10 '17 at 13:32
  • This is slightly hacky, but you could just do a string replacement for .html and replace it with nothing. – Froopy Nov 10 '17 at 13:41
  • Assuming that html extension appears at the end of URL only, you may go with sth like [**`^\/scan\/[^=]*se=([\w.-]+)\/se=((?:[\w.-](?!\w*$))+)`**](https://regex101.com/r/kMVn6W/1) – revo Nov 10 '17 at 13:49
  • I think that does it, thanks Revo, just testing it. – Andrew Smith Nov 10 '17 at 14:16
  • I think it still need /?(?:.html)? on the end though? – Andrew Smith Nov 10 '17 at 14:18
  • I don't know if that's a case you can encounter, but revo's regex wouldn't capture `1.5` in `/scan/anything/se=hello-world/se=1.5` (without a `/` at the end of the string). If that's a case you can encounter, please have a look at my suggestion which does capture `1.5` in this case. Plus it allows more `se` parameters than two (again, if that's a case you want to handle). – Paul-Etienne Nov 10 '17 at 14:26
  • Thanks, I'm testing these in apache at the moment – Andrew Smith Nov 10 '17 at 15:05

2 Answers2

0

You want to use a negative character class like this combined with a positive look ahead, which doesn't count as part of the capturing group:

se=([^/]+)/se=((?:[^/]+)(?=\.html)|[^/]+)

That way you capture all non / up to the next /

Here is a little example in Python:

import re

thelist = [
"/scan/anything/se=hello-world/se=word.html",
"/scan/anything/se=hello-world/se=1.5/",
"/scan/anything/se=temp-2.5/se=1.5.html",
]

regex = "se=([^/]+)/se=((?:[^/]+)(?=\.html)|[^/]+)"

for item in thelist:
    thematch = re.search(regex, item)
    print(thematch.group(1))
    print(thematch.group(2))
    print("------------")

results:

hello-world
word
------------
hello-world
1.5
------------
temp-2.5
1.5
------------

http://regex101.com is a nice little site to play around with this kind of stuff if you need to tweak a regex

sniperd
  • 5,124
  • 6
  • 28
  • 44
0

I suggest this regex :

se=((?:[\w-.]+)(?=\.html)|[\w-.]+)

See this demo.

This will match any word that can contain - or . until a potential .html (it will stop right before .html if any).

Edit :

The above regex won't capture .html even if it's inside the URL, like at the end of a parameter. For example, this is what would be captured in this case :

/scan/anything/se=hello-world.html/se=word.html
                              ^^^^^^^^^^^              ^^^^^

So if you want to capture everything but the very last .html, you'd have to add an end of string character $ :

se=((?:[\w-.]+)(?=\.html$)|[\w-.]+)

See this second demo.

Edit 2 :

In the light of the information provided by OP's comment down here, this regex would be more appropriate to make URL redirection :

^\/scan\/anything\/se=([\w-.]+)\/se=((?:[\w-.]+)(?=\.html)|[\w-.]+)

See this demo.

This will capture both se parameters in $1 and $2 respectively for each URL while still matching the same inputs as the above regular expressions.

Paul-Etienne
  • 796
  • 9
  • 23
  • Well I tried this in apache, I changed the regex start to "test" rather than "scan" as its a live site, so I tried this: `RewriteCond %{REQUEST_URI} ^/test/(.*)$ RewriteRule ^/test/.*?se=((?:[\w-.]+)(?=\.html)|[\w-.]+) /newurl/$1-$2 [R=301,L]` with url /test/se=word1/se=word2 I got /newurl/word1- – Andrew Smith Nov 10 '17 at 15:12
  • I think your question was lacking some context. Could you add more precisely the output you were expecting ? Are you trying to capture the whole `se=hello-world/se=word` string for example ? – Paul-Etienne Nov 10 '17 at 15:17
  • I think I know now what you want exactly. Check out my update and tell me if it's doing the job. – Paul-Etienne Nov 10 '17 at 15:25
  • Sorry it it wasn't clear, /scan/se=word1/se=word2 should rewrite to /newurl/word1-word2 word1 and word2 can have hyphens or decimal points so /se=temp1.5/se=hello-world should rewrite as /temp1.5-hello-world – Andrew Smith Nov 10 '17 at 15:48
  • It seems to work! Thanks. I also need a rule for 3 and 4 se= matches, is it best to use separate rewrite rules with separate regexes? – Andrew Smith Nov 10 '17 at 15:59
  • I don't have such a wide knowledge of Apache redirection rules, but if you don't have a way to dynamically call the matching groups ($1 to $4), then I guess your only choice is to write three separate rules. – Paul-Etienne Nov 10 '17 at 16:18