1

I have a huge list of triples like this:

?s ex:url ?url

Where ?url can be:

www.ex.com/data/1.html
www.ex.com/data/2.html
www.google.com/search
...

Is it possible, with a SPARQL query, to filter the query somehow and obtain the distinct list of domains? In the example, www.ex.com and www.google.com.

Somthing like this:

SELECT distinct ?url
WHERE { ?s ex:url ?url }

But treating each url bind. Of course I could get them all, and treat each url one by one in my program, but I suppose a sparql query would be more memory efficient. I am using Stardog - in case it has some custom functionality.

user1156544
  • 1,725
  • 2
  • 25
  • 51
  • You could BIND the domain of a URL to a new variable and SELECT this variable then. See the SPARQL specs for String operations and REGEX. Something like the substring until the first occurrence of `/` should work. – UninformedUser Oct 21 '16 at 17:28
  • Can you please elaborate on how I could bind the domain of the URLs to a new variable? I know the REGEX operations, but they seem to discard results via FILTER – user1156544 Oct 21 '16 at 17:36

2 Answers2

5

You can do something like this using string manipulation that doesn't require regular expressions. E.g., you can take the part of the string form of the URL after a "//" and before a "/":

select ?url ?hostname {
  values ?url { <http://example.org/index.html> }
  bind(strbefore(strafter(str(?url),"//"),"/") as ?hostname)
}
---------------------------------------------------
| url                             | hostname      |
===================================================
| <http://example.org/index.html> | "example.org" |
---------------------------------------------------

That doesn't use regular expressions, and might be faster than a solution using the regex function.

However, this might still get you more than a hostname, e.g., if the URL is something like http://username:password@example.org:8080, where you'd get username:password@example.org:8080, which is more than just hostname.

To do this more carefully, you'd want to pick one of the URI/URL, etc., specifications, such as RFC 3986, and have a look at the section on syntax components. A few relevant productions from that grammar are:

URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

      hier-part   = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty

The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.

authority   = [ userinfo "@" ] host [ ":" port ]

I won't work through all that (and maybe it would make more sense to use a regular expression to handle the complex cases), but it might be easiest to just take the URI from the SPARQL result and then use an actual URI parsing library to get the hostname. That's the most reliable solution, since URIs can be pretty complex.

Community
  • 1
  • 1
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
  • STRBEFORE+STRAFTER does not work if the URL is not complete (missing scheme), as in the examples in the question. Although apparently this was not the case in the actual data. – evsheino Oct 21 '16 at 22:12
  • I agree with your last paragraph. Using a URI parsing library sounds like the most robust way to program this. I will check your solution too and measure all times to see what of the 3 options perform better. Thanks – user1156544 Oct 21 '16 at 22:35
4

Use REPLACE with REGEX:

BIND(REPLACE(STR(?url), "^(.*?)/.*", "$1") AS ?domain)

Example in Yasgui

Edit: As @JoshuaTailor noted in the comments, STRBEFORE is better if there is no scheme in ?url:

BIND(STRBEFORE(?url, "/") AS ?domain)

If you need to worry about the URL scheme (this discards the scheme):

BIND(REPLACE(STR(?url), "^(https?://)?(.*?)/.*", "$2") AS ?domain)

Of course, the above only works for basic http(s) URLs, and the regex becomes somewhat more complex if arbitrary URLs need to be handled.

Here's one that handles any or missing scheme, port number, auth info, and missing trailing slash:

BIND(REPLACE(?url, "^(?:.*?://)?(?:.*?@)?([^:]+?)(:\\d+)?((/.*)|$)", "$1") AS ?domain)

Note that queries with regular expressions can be quite slow.

evsheino
  • 2,147
  • 18
  • 20
  • Your 2nd Regex works great! Thanks! The 1st one only returns http:// so I suspect it needs further tunning for my case, but the 2nd is good. I will do a performance test because as you said it might be very costly – user1156544 Oct 21 '16 at 20:08
  • There's no need for regex here. You can just use STRBEFORE and get the string before `/`. And the second REGEX doesn't work for non-HTTP(S) URLs, such as `ftp`, etc. This would also have issues with URLs that include authentication information and port information (e.g., `http://user:password@example.org:8080/index.html`, where you'd get `user:password@example.org:8080`. – Joshua Taylor Oct 21 '16 at 21:25
  • By the way, the 1st one should work too for my question - my mistake, I forgot that the real data has the schema (http://...) – user1156544 Oct 21 '16 at 22:33