Regex to match decimal point but not .html

Question

I have urls in this format:-

/scan/anything/se=hello-world/se=word.html
/scan/anything/se=hello-world/se=1.5/
/scan/anything/se=temp-2.5/se=1.5.html

I'm trying to match word characters after each se= plus the dash and decimal and capture them.

The regex I have come up with is this:-

  ^/scan/.*?se=([\w-.]*)/?(?:se=)([\w-.]*)/?(?:.html)?

Because I have added a dot(.) in the character class to match the decimal point it also matches .html so captures word.html and 1.5.html rather than just "word" and "1.5" from urls 1 and 3, how can I stop it matching .html I've tried various negations but none seem to work.

Desired output:

hello-world and word
hello-world and 1.5
temp-2.5 and 1.5

This is slightly hacky, but you could just do a string replacement for .html and replace it with nothing. — Froopy, Nov 10 '17 at 13:41
Assuming that html extension appears at the end of URL only, you may go with sth like [**`^\/scan\/[^=]*se=([\w.-]+)\/se=((?:[\w.-](?!\w*$))+)`**](https://regex101.com/r/kMVn6W/1) — revo, Nov 10 '17 at 13:49
I don't know if that's a case you can encounter, but revo's regex wouldn't capture `1.5` in `/scan/anything/se=hello-world/se=1.5` (without a `/` at the end of the string). If that's a case you can encounter, please have a look at my suggestion which does capture `1.5` in this case. Plus it allows more `se` parameters than two (again, if that's a case you want to handle). — Paul-Etienne, Nov 10 '17 at 14:26

sniperd · Answer 1 · 2017-11-10T14:27:27.137

0

You want to use a negative character class like this combined with a positive look ahead, which doesn't count as part of the capturing group:

se=([^/]+)/se=((?:[^/]+)(?=\.html)|[^/]+)

That way you capture all non / up to the next /

Here is a little example in Python:

import re

thelist = [
"/scan/anything/se=hello-world/se=word.html",
"/scan/anything/se=hello-world/se=1.5/",
"/scan/anything/se=temp-2.5/se=1.5.html",
]

regex = "se=([^/]+)/se=((?:[^/]+)(?=\.html)|[^/]+)"

for item in thelist:
    thematch = re.search(regex, item)
    print(thematch.group(1))
    print(thematch.group(2))
    print("------------")

results:

hello-world
word
------------
hello-world
1.5
------------
temp-2.5
1.5
------------

http://regex101.com is a nice little site to play around with this kind of stuff if you need to tweak a regex

edited Nov 10 '17 at 14:27

answered Nov 10 '17 at 13:45

sniperd

5,124
6
28
44

OP doesn't want to capture `.html` – Paul-Etienne Nov 10 '17 at 13:51
Thanks but need to negate the .html bit? – Andrew Smith Nov 10 '17 at 14:05
OK, updated finally. The .html now isn't included in the capture group. – sniperd Nov 10 '17 at 14:21

Paul-Etienne · Accepted Answer · 2017-11-10T15:24:38.713

0

I suggest this regex :

se=((?:[\w-.]+)(?=\.html)|[\w-.]+)

See this demo.

This will match any word that can contain - or . until a potential .html (it will stop right before .html if any).

Edit :

The above regex won't capture .html even if it's inside the URL, like at the end of a parameter. For example, this is what would be captured in this case :

/scan/anything/se=hello-world.html/se=word.html
^^^^^^^^^^^ ^^^^^

So if you want to capture everything but the very last .html, you'd have to add an end of string character $ :

se=((?:[\w-.]+)(?=\.html$)|[\w-.]+)

See this second demo.

Edit 2 :

In the light of the information provided by OP's comment down here, this regex would be more appropriate to make URL redirection :

^\/scan\/anything\/se=([\w-.]+)\/se=((?:[\w-.]+)(?=\.html)|[\w-.]+)

See this demo.

This will capture both se parameters in $1 and $2 respectively for each URL while still matching the same inputs as the above regular expressions.

edited Nov 10 '17 at 15:24

answered Nov 10 '17 at 14:07

Paul-Etienne

796
9
23

Well I tried this in apache, I changed the regex start to "test" rather than "scan" as its a live site, so I tried this: `RewriteCond %{REQUEST_URI} ^/test/(.*)$ RewriteRule ^/test/.*?se=((?:[\w-.]+)(?=\.html)|[\w-.]+) /newurl/$1-$2 [R=301,L]` with url /test/se=word1/se=word2 I got /newurl/word1- – Andrew Smith Nov 10 '17 at 15:12
I think your question was lacking some context. Could you add more precisely the output you were expecting ? Are you trying to capture the whole `se=hello-world/se=word` string for example ? – Paul-Etienne Nov 10 '17 at 15:17
I think I know now what you want exactly. Check out my update and tell me if it's doing the job. – Paul-Etienne Nov 10 '17 at 15:25
Sorry it it wasn't clear, /scan/se=word1/se=word2 should rewrite to /newurl/word1-word2 word1 and word2 can have hyphens or decimal points so /se=temp1.5/se=hello-world should rewrite as /temp1.5-hello-world – Andrew Smith Nov 10 '17 at 15:48
It seems to work! Thanks. I also need a rule for 3 and 4 se= matches, is it best to use separate rewrite rules with separate regexes? – Andrew Smith Nov 10 '17 at 15:59
I don't have such a wide knowledge of Apache redirection rules, but if you don't have a way to dynamically call the matching groups ($1 to $4), then I guess your only choice is to write three separate rules. – Paul-Etienne Nov 10 '17 at 16:18

Regex to match decimal point but not .html

2 Answers2