3

I'm trying to write a regex that will capture the domain and path from a URL. I've tried:

https?:\/\/(.+)(\/.*)

That works fine for http://example.com/foo:

Match 1
0.  google.com
1.  /foo

But not what I would expect for http://example.com/foo/bar:

Expected:

Match 1
0.  google.com
1.  /foo/bar

Actual:

Match 1
0.  google.com/foo
1.  /bar

What am I doing wrong?

Sean W.
  • 4,944
  • 8
  • 40
  • 66
  • 3
    Is there any reason you want to do this with a regex? The [`urlparse`](http://docs.python.org/2/library/urlparse.html) module from the standard library does this and more. – Daniel Roseman Jan 31 '14 at 21:18
  • Related question that may help: http://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex – dcp Jan 31 '14 at 21:21
  • @DanielRoseman urlparse does a nice job of breaking up the URL, but I want the path including queries, parameters, and fragments. That will be useful for other cases. Thanks! – Sean W. Feb 03 '14 at 13:50

3 Answers3

6

https?:\/\/(.+)(\/.*)

What am I doing wrong?

+ is greedy. You should use it on [^/] instead of a dot.

Also notice that your “path” part will contain also query string and fragment (hash).

This one gets just the domain (+ login, password, port) and path (without query string or fragment).

^https?://([^/]+)(/[^?#]*)?

I leave escaping the slashes accordingly up to you.

Caveat: This expects a valid URI and for such it is good and parses the authority and path sections. If you want to parse a URI according to the standard, you need to implement the whole grammar or get the official regex from §8 of RFC 2396.

The following line is the regular expression for breaking-down a URI reference into its components.

   ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

   http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

   $1 = http:
   $2 = http
   $3 = //www.ics.uci.edu
   $4 = www.ics.uci.edu
   $5 = /pub/ietf/uri/
   $6 = <undefined>
   $7 = <undefined>
   $8 = #Related
   $9 = Related

where indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the four components and fragment as

   scheme    = $2
   authority = $4
   path      = $5
   query     = $7
   fragment  = $9
Palec
  • 12,743
  • 8
  • 69
  • 138
  • Don't need the `\/`[^?]* part because that class will match the first `/` past the domain. If you require it, the regex will fail if that is not there in the string. –  Jan 31 '14 at 21:41
  • @sln Not a long ago (several weeks) I looked it up in the RFC. The slash after domain (and optional port number) is obligatory. If any URL does not have it… well, it is not a URL. If you want to be as forgiving as possible, change it to ``(\/[^?]*)?``. – Palec Jan 31 '14 at 21:44
  • Its a stickler. If this `^https?://([^/]+)([^?]*)`, the first character of capture group 2 will have no choice but to be a `/`, otherwise capture group 1 will have captured to the end of the string, leaving capture group 2 empty. I agree with you, but why fail the regex, when you can check the length of group 2 and still get some info on the domain. –  Feb 01 '14 at 21:16
  • @sln True in this case as the regex engine never needs to backtrack. But still it is more clear to write the slash there. I think it is not immediately obvious that backtracking cannot be done and in case it was done, the first group would not need to end immediately before a slash. [Possessive](http://www.regular-expressions.info/possessive.html) `+` would be needed (`++`). – Palec Feb 01 '14 at 21:42
  • @sln I read [RFC 1738](http://www.ietf.org/rfc/rfc1738.txt) on 2014-01-04 and now I realize I forgot what it says. §3.1 and more explicitly §3.3 say that the slash between port and path is not part of path and it is required if and only if something follows (path or query string). Fragment (hash) is not a part of URL, the standard does not speak about it. It is part of URI reference, defined in URI standard, [RFC 2396](http://www.ietf.org/rfc/rfc2396.txt). Reading its §3 now, I realize that a few things changed. The slash is now part of path and may be omitted even when query string is not. – Palec Feb 02 '14 at 00:37
  • If '/' is left in the regex could fail if not there, if optional and not there, Grp 2 will be empty. It's only two options. In this case, [^?]* is identical to /[^?]* if both are expressions are optional. And makes the eye look twice. –  Feb 02 '14 at 00:37
  • @sln Did not get your comment. Updated my answer to correctly parse things like `http://example.com`. If you are trying to tell me that `^https?://([^/]+)([^?]*)` is better than `^https?://([^/]+)(/[^?]*)?`, I disagree. – Palec Feb 02 '14 at 00:56
  • They are identical. [^?]* matches /. To use both is redundant. In this regex, / is guaranteed to be the first character matched. When its literal in the regex before the class, it gives the illusion the first character could be something different, making the reader wonder if something else was meant by the author. –  Feb 02 '14 at 22:07
  • @sln This works kind of like an assertion in a programming language. Although you think that a condition is always satisfied, you still test if it is, just to be sure. If you ever change the code later, it may break terribly if the assertion is not satisfied. Also assertions help see, which invariants hold. Although you can often guess an invariant, it might take you some time to prove it. Assertion does it for you. Slight redundancy is not bad. – Palec Feb 03 '14 at 03:56
6

As noted - this is a non griddy version: https?:\/\/(.+?)(\/.*)

GabiMe
  • 18,105
  • 28
  • 76
  • 113
0

Something like this 'greedy' version might work. I don't know if Python requires delimiters, so this is just the raw regex.

 #   https?://([^/]+)(.*)

 https?://
 ( [^/]+ )           # (1)
 ( .* )              # (2)