38

I have a URL like this:

http://192.168.0.1:8080/servlet/rece

I want to parse the URL to get the values:

IP: 192.168.0.1
Port: 8080
page:  /servlet/rece

How do I do that?

Mike Woodhouse
  • 51,832
  • 12
  • 88
  • 127
Jiang Bian
  • 1,933
  • 3
  • 16
  • 14

10 Answers10

29

Personally, I steal the HTParse.c module from the W3C (it is used in the lynx Web browser, for instance). Then, you can do things like:

 strncpy(hostname, HTParse(url, "", PARSE_HOST), size)

The important thing about using a well-established and debugged library is that you do not fall into the typical traps of URL parsing (many regexps fail when the host is an IP address, for instance, specially an IPv6 one).

John Auld
  • 476
  • 2
  • 12
bortzmeyer
  • 34,164
  • 12
  • 67
  • 91
  • 1
    In particular, be aware that with IPv6 there are ambiguous cases if you try to use the colon separator. e.g. 3ffe:0501::1:2, is that a port of 2, or a full address with your default port. The URL specs have dealt with this, as have the the prewritten libraries. – bitmusher Jan 18 '13 at 21:53
  • 3
    Note there is no real ambiguity. The URI standard, RFC 3986, is clear and your example is illegal (you need square brackets). – bortzmeyer Jan 31 '13 at 09:35
  • 3
    Thanks, this is comforting. I was under the mistaken impression that user facing code, like browser address bars, was accepting the addresses without square brackets. A quick tour of some popular browsers reveals this is not the case. – bitmusher Feb 02 '13 at 18:03
  • 1
    `HTParse.c` has a number of dependencies, any chance you can explain how you can "steal" this from the project easily? Maybe back in 2009 it did not ;) – Carson Reinke Jan 30 '19 at 18:38
15

I wrote a simple code using sscanf, which can parse very basic URLs.

#include <stdio.h>

int main(void)
{
    const char text[] = "http://192.168.0.2:8888/servlet/rece";
    char ip[100];
    int port = 80;
    char page[100];
    sscanf(text, "http://%99[^:]:%99d/%99[^\n]", ip, &port, page);
    printf("ip = \"%s\"\n", ip);
    printf("port = \"%d\"\n", port);
    printf("page = \"%s\"\n", page);
    return 0;
}

./urlparse
ip = "192.168.0.2"
port = "8888"
page = "servlet/rece"
Community
  • 1
  • 1
Jiang Bian
  • 1,933
  • 3
  • 16
  • 14
10

May be late,... what I have used, is - the http_parser_parse_url() function and the required macros separated out from Joyent/HTTP parser lib - that worked well, ~600LOC.

Ani
  • 1,448
  • 1
  • 16
  • 38
  • Yep. The node.js HTTP parser lib is great and very well tested for anything that has to do with HTTP requests / responses. – Jan Jongboom Feb 15 '17 at 11:35
10

With a regular expression if you want the easy way. Otherwise use FLEX/BISON.

You could also use a URI parsing library

rici
  • 234,347
  • 28
  • 237
  • 341
dsm
  • 10,263
  • 1
  • 38
  • 72
  • 1
    Indeed, using a library seems the only reasonable thing, since there are many traps (http vs. https, explicit port, encoding in the path, etc). – bortzmeyer Apr 07 '09 at 17:05
  • Hi, I wrote a BNF for url, like this. URL = "http://" {IP} {PORT}? {PAGE}? A flex generated a file which parsed the url. But how to fetch the individual parts like IP, PORT and PAGE. from the URL – Hemant Patel Jul 07 '16 at 06:58
3

Libcurl now has curl_url_get() function that can extract host, path, etc.

Example code: https://curl.haxx.se/libcurl/c/parseurl.html

/* extract host name from the parsed URL */ 
uc = curl_url_get(h, CURLUPART_HOST, &host, 0);
if(!uc) {
  printf("Host name: %s\n", host);
  curl_free(host);
}
JohnMudd
  • 13,607
  • 2
  • 26
  • 24
2

This one has reduced size and worked excellent for me http://draft.scyphus.co.jp/lang/c/url_parser.html . Just two files (*.c, *.h).
I had to adapt code [1].

[1]Change all the function calls from http_parsed_url_free(purl) to parsed_url_free(purl)

   //Rename the function called
   //http_parsed_url_free(purl);
   parsed_url_free(purl);
tremendows
  • 4,262
  • 3
  • 34
  • 51
2

Pure sscanf() based solution:

//Code
#include <stdio.h>

int
main (int argc, char *argv[])
{
    char *uri = "http://192.168.0.1:8080/servlet/rece"; 
    char ip_addr[12], path[100];
    int port;
    
    int uri_scan_status = sscanf(uri, "%*[^:]%*[:/]%[^:]:%d%s", ip_addr, &port, path);
    
    printf("[info] URI scan status : %d\n", uri_scan_status);
    if( uri_scan_status == 3 )
    {   
        printf("[info] IP Address : '%s'\n", ip_addr);
        printf("[info] Port: '%d'\n", port);
        printf("[info] Path : '%s'\n", path);
    }
    
    return 0;
}

However, keep in mind that this solution is tailor made for [protocol_name]://[ip_address]:[port][/path] type of URI's. For understanding more about the components present in the syntax of URI, you can head over to RFC 3986.

Now let's breakdown our tailor made format string : "%*[^:]%*[:/]%[^:]:%d%s"

  • %*[^:] helps to ignore the protocol/scheme (eg. http, https, ftp, etc.)

    It basically captures the string from the beginning until it encounters the : character for the first time. And since we have used * right after the % character, therefore the captured string will be ignored.

  • %*[:/] helps to ignore the separator that sits between the protocol and the IP address, i.e. ://

  • %[^:] helps to capture the string present after the separator, until it encounters :. And this captured string is nothing but the IP address.

  • :%d helps to capture the no. sitting right after the : character (the one which was encountered during the capturing of IP address). The no. captured over here is basically your port no.

  • %s as you may know, will help you to capture the remaining string which is nothing but the path of the resource you are looking for.

Community
  • 1
  • 1
Argon
  • 752
  • 7
  • 18
1

This C gist could be useful. It implements a pure C solution with sscanf.

https://github.com/luismartingil/per.scripts/tree/master/c_parse_http_url

It uses

// Parsing the tmp_source char*
if (sscanf(tmp_source, "http://%99[^:]:%i/%199[^\n]", ip, &port, page) == 3) { succ_parsing = 1;}
else if (sscanf(tmp_source, "http://%99[^/]/%199[^\n]", ip, page) == 2) { succ_parsing = 1;}
else if (sscanf(tmp_source, "http://%99[^:]:%i[^\n]", ip, &port) == 2) { succ_parsing = 1;}
else if (sscanf(tmp_source, "http://%99[^\n]", ip) == 1) { succ_parsing = 1;}
(...)
luismartingil
  • 1,029
  • 11
  • 16
  • third if statement will never be tested, becouse second one has the same meaning, so this could make a problem with port/page – Risinek Apr 01 '16 at 13:41
1

I wrote this

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
typedef struct
{
    const char* protocol = 0;
    const char* site = 0;
    const char* port = 0;
    const char* path = 0;
} URL_INFO;
URL_INFO* split_url(URL_INFO* info, const char* url)
{
    if (!info || !url)
        return NULL;
    info->protocol = strtok(strcpy((char*)malloc(strlen(url)+1), url), "://");
    info->site = strstr(url, "://");
    if (info->site)
    {
        info->site += 3;
        char* site_port_path = strcpy((char*)calloc(1, strlen(info->site) + 1), info->site);
        info->site = strtok(site_port_path, ":");
        info->site = strtok(site_port_path, "/");
    }
    else
    {
        char* site_port_path = strcpy((char*)calloc(1, strlen(url) + 1), url);
        info->site = strtok(site_port_path, ":");
        info->site = strtok(site_port_path, "/");
    }
    char* URL = strcpy((char*)malloc(strlen(url) + 1), url);
    info->port = strstr(URL + 6, ":");
    char* port_path = 0;
    char* port_path_copy = 0;
    if (info->port && isdigit(*(port_path = (char*)info->port + 1)))
    {
        port_path_copy = strcpy((char*)malloc(strlen(port_path) + 1), port_path);
        char * r = strtok(port_path, "/");
        if (r)
            info->port = r;
        else
            info->port = port_path;
    }
    else
        info->port = "80";
    if (port_path_copy)
        info->path = port_path_copy + strlen(info->port ? info->port : "");
    else 
    {
        char* path = strstr(URL + 8, "/");
        info->path = path ? path : "/";
    }
    int r = strcmp(info->protocol, info->site) == 0;
    if (r && info->port == "80")
        info->protocol = "http";
    else if (r)
        info->protocol = "tcp";
    return info;
}

Test

int main()
{
    URL_INFO info;
    split_url(&info, "ftp://192.168.0.1:8080/servlet/rece");
    printf("Protocol: %s\nSite: %s\nPort: %s\nPath: %s\n", info.protocol, info.site, info.port, info.path);
    return 0;
}

Out

Protocol: ftp
Site: 192.168.0.1
Port: 8080
Path: /servlet/rece
Beyondo
  • 2,952
  • 1
  • 17
  • 42
-3

Write a custom parser or use one of the string replace functions to replace the separator ':' and then use sscanf().

dirkgently
  • 108,024
  • 16
  • 131
  • 187
  • 22
    There are many traps to watch so a custom parser seems to me a bad idea. – bortzmeyer Apr 07 '09 at 16:53
  • 1
    @bortzmeye: that doesn't make the suggestion invalid. It's vague reasoning. Also, a custom parser is the most powerful/efficient/dependency free. The sscanf is easier to get wrong. – dirkgently Apr 07 '09 at 17:00
  • 30
    how is "write some code that does what you need" an accepted answer? – Spike0xff Aug 21 '16 at 03:47