5

i'm looking for a .NET Framework class that can parse URLs.

Some examples of URL's that require parsing:

  • server:8088
  • server:8088/func1
  • server:8088/func1/SubFunc1
  • http://server
  • http://server/func1
  • http://server/func/SubFunc1
  • http://server:8088
  • http://server:8088/func1
  • http://server:8088/func1/SubFunc1
  • magnet://server
  • magnet://server/func1
  • magnet://server/func/SubFunc1
  • magnet://server:8088
  • magnet://server:8088/func1
  • magnet://server:8088/func1/SubFunc1

The problem is that the Uri and UriBuilder classes do not handle the URLs correctly. For example, they are confused by:

stackoverflow.com:8088

Background on Urls

The format of a Url is:

  foo://example.com:8042/over/there?name=ferret#nose
  \_/   \_________/ \__/\_________/\__________/ \__/
   |         |        |     |           |        |
scheme      host    port   path       query   fragment

In our case, we only care about:

  • Uri.Scheme
  • Uri.Host
  • Uri.Port
  • Uri.Path

Tests

Running some tests, we can check how UriBuilder class handles various Uri's:

                                        Expected  Expected Expected    Expected
//Test URI                               Scheme    Server    Port        Path
//=====================================  ========  ========  ====  ====================
t("server",                              "",       "server", -1,   "");
t("server/func1",                        "",       "server", -1,   "/func1");
t("server/func1/SubFunc1",               "",       "server", -1,   "/func1/SubFunc1");
t("server:8088",                         "",       "server", 8088, "");
t("server:8088/func1",                   "",       "server", 8088, "/func1");
t("server:8088/func1/SubFunc1",          "",       "server", 8088, "/func1/SubFunc1");
t("http://server",                       "http",   "server", -1,   "/func1");
t("http://server/func1",                 "http",   "server", -1,   "/func1");
t("http://server/func/SubFunc1",         "http",   "server", -1,   "/func1/SubFunc1");
t("http://server:8088",                  "http",   "server", 8088, "");
t("http://server:8088/func1",            "http",   "server", 8088, "/func1");
t("http://server:8088/func1/SubFunc1",   "http",   "server", 8088, "/func1/SubFunc1");
t("magnet://server",                     "magnet", "server", -1,   "");
t("magnet://server/func1",               "magnet", "server", -1,   "/func1");
t("magnet://server/func/SubFunc1",       "magnet", "server", -1,   "/func/SubFunc1");
t("magnet://server:8088",                "magnet", "server", 8088, "");
t("magnet://server:8088/func1",          "magnet", "server", 8088, "/func1");
t("magnet://server:8088/func1/SubFunc1", "magnet", "server", 8088, "/func1/SubFunc1");

All but six cases fail to parse correctly:

Url                                  Scheme  Host    Port  Path
===================================  ======  ======  ====  ===============
server                               http    server  80    /
server/func1                         http    server  80    /func1
server/func1/SubFunc1                http    server  80    /func1/SubFunc1
server:8088                          server          -1    8088
server:8088/func1                    server          -1    8088/func1
server:8088/func1/SubFunc1           server          -1    8088/func1/SubFunc1
http://server                        http    server  80    /
http://server/func1                  http    server  80    /func1
http://server/func/SubFunc1          http    server  80    /func1/SubFunc1
http://server:8088                   http    server  8088  /
http://server:8088/func1             http    server  8088  /func1
http://server:8088/func1/SubFunc1    http    server  8088  /func1/SubFunc1
magnet://server                      magnet  server  -1    /
magnet://server/func1                magnet  server  -1    /func1
magnet://server/func/SubFunc1        magnet  server  -1    /func/SubFunc1
magnet://server:8088                 magnet  server  8088  /
magnet://server:8088/func1           magnet  server  8088  /func1
magnet://server:8088/func1/SubFunc1  magnet  server  8088  /func1/SubFunc1

i said i wanted a .NET Framework class. i would also accept any code-gum laying around that i can pick up and chew. As long as it satisfies my simplistic test cases.

Bonus Chatter

i was looking at expanding this question, but that question is limited to http only.

i also asked this same question earlier today, but i realize now that i phrased it incorrectly. i incorrectly asked how to "build" a url. When in reality i want to "parse" a user-entered URL. i can't go back and fundamentally change the title now. So i'll ask the same question again, only better, with more clearly stated goals, here.

Bonus Reading

Community
  • 1
  • 1
Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • 3
    Based on what I see in [RFC 3986](http://tools.ietf.org/html/rfc3986#section-3.1) the scheme portion of URIs is mandatory. What would your app do when a user didn't enter the scheme? – PrimeNerd Nov 24 '13 at 02:29
  • @AndyB It would assume one appropriate for the application (*e.g. `stratum+udp://`*) For example, when i type a URL into the address bar (`stackoverflow.com:8088`) it able to parse it, realize a scheme is mising, assume you meant `http`, and add it. **tl;dr**: i want to do what Chrome, FireFox, Safari, Internet Explorer, curl, and wget do. – Ian Boyd Nov 28 '13 at 20:37
  • Is there something wrong with my answer? :) – Luaan Jan 10 '14 at 14:58
  • @Luaan Sorry, i know you put in the work. But i was hoping to find the .NET class that exists for this purpose. i've been doing this long enough to know that i could never write a regex that handles all the cases i encounter in the wild. As you say, it's not perfect. And a search of SO will find a few dozen different expressions that try to do the same thing; with differing levels of success. – Ian Boyd Jan 10 '14 at 15:25
  • Yeah, it's pretty much impossible to parse the URL perfectly. After all, if you have an IPv6 address instead of host name, suddenly colon is a valid character in the hostname itself. And handling all the possibilities quickly spirals out of control. And then you include unicode domains, and escaping, and... I expect that each browser does the heuristics a tiny bit differently too. – Luaan Jan 10 '14 at 15:46
  • The question you linked still provides a useful answer because the Uri class isn't limited to HTTP. – Casey Oct 02 '14 at 19:08

1 Answers1

1

Will this regular expression do?

^((?<schema>[a-z]*)://)?(?<host>[^/:]*)?(:(?<port>[0-9]*))?(?<path>/.*)?$

It's not perfect, but it seems to work for your test cases.

Luaan
  • 62,244
  • 7
  • 97
  • 116