5

i need to extract the top domain of an url and i got his http://publicsuffix.org/index.html

and the java implementation is in http://guava-libraries.googlecode.com and i could not find any example to extract domain name

say example..
example.google.com
returns google.com

and bing.bing.bing.com
returns bing.com

can any one tell me how can i implement using this library with an example....

ColinD
  • 108,630
  • 30
  • 201
  • 202
ramuvan
  • 53
  • 1
  • 3
  • So, you're looking to extract [TLD](http://en.wikipedia.org/wiki/Top-level_domain) (the `.com` part) and [SLD](http://en.wikipedia.org/wiki/Second-level_domain) (the `google` or `bing` part) from URLs? – Matt Ball Jan 27 '11 at 17:36
  • If you just want the last two parts of the domain, couldn't you just `String.split('\\.')` to get the parts and return the last two? Or do a `String.substring(indexOfPenultimatePeriod)` after (easily) working out the appropriate index? What is the complexity here? – Andrzej Doyle Jan 27 '11 at 17:37
  • @Andrzej Doyle ya..you are right and that is an url list with 10k urls with different suffix like it has .com,.com.jp,.org,com.in,etc.... – ramuvan Jan 27 '11 at 17:40
  • @ramuvan - good point, you should add those cases to the examples. The only way to cope with this is to have a list of definitive TLDs, and match the end of your domain string against them. – Andrzej Doyle Jan 27 '11 at 18:06
  • @ramuvan: Guava does have a solution that makes this easy... see my answer. – ColinD Jan 27 '11 at 19:13

3 Answers3

18

It looks to me like InternetDomainName.topPrivateDomain() does exactly what you want. Guava maintains a list of public suffixes (based on Mozilla's list at publicsuffix.org) that it uses to determine what the public suffix part of the host is... the top private domain is the public suffix plus its first child.

Here's a quick example:

public class Test {
  public static void main(String[] args) throws URISyntaxException {
    ImmutableList<String> urls = ImmutableList.of(
        "http://example.google.com", "http://google.com", 
        "http://bing.bing.bing.com", "http://www.amazon.co.jp/");
    for (String url : urls) {
      System.out.println(url + " -> " + getTopPrivateDomain(url));
    }
  }

  private static String getTopPrivateDomain(String url) throws URISyntaxException {
    String host = new URI(url).getHost();
    InternetDomainName domainName = InternetDomainName.from(host);
    return domainName.topPrivateDomain().name();
  }
}

Running this code prints:

http://example.google.com -> google.com
http://google.com -> google.com
http://bing.bing.bing.com -> bing.com
http://www.amazon.co.jp/ -> amazon.co.jp
evandrix
  • 6,041
  • 4
  • 27
  • 38
ColinD
  • 108,630
  • 30
  • 201
  • 202
  • 1
    TLD and Public Suffix are not the same. For example `http://myblog.blogspot.com -> myblog.blogspot.com`. Read [this](https://code.google.com/p/guava-libraries/wiki/InternetDomainNameExplained) for further details – gamliela Mar 03 '14 at 17:47
  • Do you know why `s3.amazonaws.com` returns a null? – chrisTina Nov 14 '14 at 16:15
  • 1
    @Liquid: `s3.amazonaws.com` is itself a public suffix: https://publicsuffix.org/list/effective_tld_names.dat – ColinD Nov 14 '14 at 17:11
  • Sorry, I works well... I implement it in a wrong way. – chrisTina Nov 14 '14 at 17:40
  • `[javac] symbol : method name() [javac] location: class com.google.common.net.InternetDomainName [javac] this.domain = domainName.topPrivateDomain().name(); [javac] ^ [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error ` What does this error mean? why `.name()` method?? – chrisTina Nov 14 '14 at 19:29
2

I recently implemented a Public Suffix List API:

PublicSuffixList suffixList = new PublicSuffixListFactory().build();

assertEquals(
    "google.com", suffixList.getRegistrableDomain("example.google.com"));

assertEquals(
    "bing.com", suffixList.getRegistrableDomain("bing.bing.bing.com"));

assertEquals(
    "amazon.co.jp", suffixList.getRegistrableDomain("www.amazon.co.jp"));
Markus Malkusch
  • 7,738
  • 2
  • 38
  • 67
1

EDIT: Sorry I've been a little too fast. I didn't think of co.jp. co.uk, and so on. You will need to get a list of possible TLDs from somewhere. You could also take a look at http://commons.apache.org/validator/ to validate a TLD.

I think something like this should work: But maybe there exists some Java-Standard Function.

String url = "http://www.foobar.com/someFolder/index.html";
if (url.contains("://")) {
  url = url.split("://")[1];
}

if (url.contains("/")) {
  url = url.split("/")[0];
}

// You need to get your TLDs from somewhere...
List<String> magicListofTLD = getTLDsFromSomewhere();

int positionOfTLD = -1;
String usedTLD = null;
for (String tld : magicListofTLD) {
  positionOfTLD = url.indexOf(tld);
  if (positionOfTLD > 0) {
    usedTLD = tld;
    break;
  }
}

if (positionOfTLD > 0) {
  url = url.substring(0, positionOfTLD);
} else {
  return;
}
String[] strings = url.split("\\.");

String foo = strings[strings.length - 1] + "." + usedTLD;
System.out.println(foo);
Andreas L.
  • 142
  • 2
  • 9
  • yeah, sorry, didn't think of co.jp, co.uk and so on. I guess you have to get a list of possible TLDs and try to match them with the String. – Andreas L. Jan 27 '11 at 18:00
  • Guava has built in functionality for doing this, including an internal TLD list that will be updated with new releases as the TLD list changes. On top of that, Java has built in functionality for parsing and getting the host part of a URL... I don't think parsing it out manually with `split` is a good idea. – ColinD Jan 27 '11 at 19:16
  • @ColinD: Nice library. Didn't know of it. – Andreas L. Jan 27 '11 at 19:27