-2

So I have a URL that could be in the following three formats:

https://www.example.com/d/abcd-1234/edit
https://www.example.com/d/abcd-1234/
https://www.example.com/d/abcd-1234

I would like to extract only the abcd-1234 bit from the above URLs. I've tried doing so using the following regular expression, however it will only capture the first and second case.

import re

id = re.search(r'https://www.example.com/d/(.+?)/', url).group(1)

For url=https://www.example.com/d/abcd-1234 the above will fail with:

AttributeError: 'NoneType' object has no attribute 'group'

given that it does not match the regex.

What regex shall I use in order to indicate that the part of interest could be followed by no character at all?

InSync
  • 4,851
  • 4
  • 8
  • 30
Tokyo
  • 753
  • 1
  • 10
  • 25
  • 1
    Given that you know it always start with `https://www.example.com/d/` (based on regex pattern) and you want just the next element of the url, why the regex, just split at `/` and take element with index 4 – buran May 04 '23 at 18:06
  • This should work too: `re.search(r'https://www.example.com/d/([-\w]+)', url).group(1)` – Swifty May 04 '23 at 18:10
  • @swift should be an answer instead of a comment. – Questor May 04 '23 at 18:19
  • @Questor, it's not much, and not better than Michael Cao's answer, so not really worthy of an answer ;) – Swifty May 04 '23 at 18:47

3 Answers3

0

Look for forward slash or end of string with $:

re.search(r"https://www.example.com/d/(.+?)(?:/|$)", url).group(1)
Michael Cao
  • 2,278
  • 1
  • 1
  • 13
  • Never use `(.+?)(?:/|$)`. Use `[^/]*` / `[^/]+`. Better performance and cross-platform support. Always check for a match before accessing the group value, or you risk getting an exception. – Wiktor Stribiżew May 05 '23 at 08:22
0

Your regexp fails to match the 3rd example URL because of trailing / in your pattern (which is not present in that URL, so no match). So you need to correct your regexp to account for that too:

r'https://www.example.com/d/(.+?)(?:/|$)'

It looks almost like yours, yet the last non-capturing group (?:/|$), that have two alternatives to match: either / character or (|) end of your string ($).

import re

urls = [
    'https://www.example.com/d/abcd-1234/edit',
    'https://www.example.com/d/abcd-1234/',
    'https://www.example.com/d/abcd-1234',
]

for url in urls:
    id = re.search(r'https://www.example.com/d/(.+?)(?:/|$)', url).group(1)
    print(f'{id} <- {url}')

produces expected

abcd-1234 <- https://www.example.com/d/abcd-1234/edit
abcd-1234 <- https://www.example.com/d/abcd-1234/
abcd-1234 <- https://www.example.com/d/abcd-1234

Alternatively, if URL structure is fixed, you can split the string by / and get the 4th element:

$ "https://www.example.com/d/abcd-1234/edit".split('/')
 0         1   2                  3    4
['https:', '', 'www.example.com', 'd', 'abcd-1234', 'edit']
Marcin Orlowski
  • 72,056
  • 11
  • 123
  • 141
0

First

Have you heard of regex101.com? I highly recommend using it.

Second.

Let's talk about your regex.

(.+?)/:

  • (...) Creates a capture group
  • .+ Match one or more characters
  • .+? Match one or more characters as few times as possible (lazy capture).
  • / Match the forward slash (/)..

As it stands right now your regex will look for a one to infinite string that ends in a forward slash.

Change your capture group (.+?)/ to capture your abcdef-1234... ([^/]+):

  • (...) Creates a capture group
  • [^/]+ Capture one or more characters that are not forward slashes.
Questor
  • 129
  • 6