Regex to extract part of URL that can be in the middle or end

Question

So I have a URL that could be in the following three formats:

https://www.example.com/d/abcd-1234/edit
https://www.example.com/d/abcd-1234/
https://www.example.com/d/abcd-1234

I would like to extract only the abcd-1234 bit from the above URLs. I've tried doing so using the following regular expression, however it will only capture the first and second case.

import re

id = re.search(r'https://www.example.com/d/(.+?)/', url).group(1)

For url=https://www.example.com/d/abcd-1234 the above will fail with:

AttributeError: 'NoneType' object has no attribute 'group'

given that it does not match the regex.

What regex shall I use in order to indicate that the part of interest could be followed by no character at all?

Given that you know it always start with `https://www.example.com/d/` (based on regex pattern) and you want just the next element of the url, why the regex, just split at `/` and take element with index 4 — buran, May 04 '23 at 18:06
This should work too: `re.search(r'https://www.example.com/d/([-\w]+)', url).group(1)` — Swifty, May 04 '23 at 18:10
@Questor, it's not much, and not better than Michael Cao's answer, so not really worthy of an answer ;) — Swifty, May 04 '23 at 18:47

score 0 · Accepted Answer · answered May 04 '23 at 18:03

0

Look for forward slash or end of string with $:

re.search(r"https://www.example.com/d/(.+?)(?:/|$)", url).group(1)

answered May 04 '23 at 18:03

Michael Cao

2,278
1
1
13

Never use `(.+?)(?:/|$)`. Use `[^/]*` / `[^/]+`. Better performance and cross-platform support. Always check for a match before accessing the group value, or you risk getting an exception. – Wiktor Stribiżew May 05 '23 at 08:22

score 0 · Answer 2 · answered May 04 '23 at 18:12

Your regexp fails to match the 3rd example URL because of trailing / in your pattern (which is not present in that URL, so no match). So you need to correct your regexp to account for that too:

r'https://www.example.com/d/(.+?)(?:/|$)'

It looks almost like yours, yet the last non-capturing group (?:/|$), that have two alternatives to match: either / character or (|) end of your string ($).

import re

urls = [
    'https://www.example.com/d/abcd-1234/edit',
    'https://www.example.com/d/abcd-1234/',
    'https://www.example.com/d/abcd-1234',
]

for url in urls:
    id = re.search(r'https://www.example.com/d/(.+?)(?:/|$)', url).group(1)
    print(f'{id} <- {url}')

produces expected

abcd-1234 <- https://www.example.com/d/abcd-1234/edit
abcd-1234 <- https://www.example.com/d/abcd-1234/
abcd-1234 <- https://www.example.com/d/abcd-1234

Alternatively, if URL structure is fixed, you can split the string by / and get the 4th element:

$ "https://www.example.com/d/abcd-1234/edit".split('/')
 0         1   2                  3    4
['https:', '', 'www.example.com', 'd', 'abcd-1234', 'edit']

score 0 · Answer 3 · answered May 04 '23 at 18:13

First

Have you heard of regex101.com? I highly recommend using it.

Second.

Let's talk about your regex.

(.+?)/:

(...) Creates a capture group
.+ Match one or more characters
.+? Match one or more characters as few times as possible (lazy capture).
/ Match the forward slash (/)..

As it stands right now your regex will look for a one to infinite string that ends in a forward slash.

Change your capture group (.+?)/ to capture your abcdef-1234... ([^/]+):

(...) Creates a capture group
[^/]+ Capture one or more characters that are not forward slashes.

Regex to extract part of URL that can be in the middle or end

3 Answers3

First

Second.