18

I am wondering if there is a parser or library in java for extracting the second level domain (SLD) in an URL - or failing that an algo or regex for doing the same. For example:

URI uri = new URI("http://www.mydomain.ltd.uk/blah/some/page.html");

String host = uri.getHost();

System.out.println(host);

which prints:

mydomain.ltd.uk

Now what I'd like to do is robustly identify the SLD ("ltd.uk") component. Any ideas?

Edit: I'm ideally looking for a general solution, so I'd match ".uk" in "police.uk", ".co.uk" in "bbc.co.uk" and ".com" in "amazon.com".

Thanks

Jason S
  • 184,598
  • 164
  • 608
  • 970
Richard H
  • 38,037
  • 37
  • 111
  • 138

10 Answers10

16

After reeading everything here, the correct solution should be (with guava)

InternetDomainName.from(uriHost).topPrivateDomain().toString();

errors when using Guava to get the private domain name

Community
  • 1
  • 1
user85155
  • 1,370
  • 16
  • 24
14

Don't know your purpose but Second-Level Domain may not mean much to you. You probably need to find public suffix and the domain right below it is what you are looking for.

Apache Http Component (HttpClient 4) comes with classes to handle this,

org.apache.http.impl.cookie.PublicSuffixFilter
org.apache.http.impl.cookie.PublicSuffixListParser

You need to download the public suffix list from here,

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

David Moles
  • 48,006
  • 27
  • 136
  • 235
ZZ Coder
  • 74,484
  • 29
  • 137
  • 169
  • I found that list earlier but didn't bother posting it because it doesn't contain ltd.uk. Will this still do what you want? – danben Dec 17 '09 at 20:35
  • @danben: yeah I'll have to add the ltd.uk (which i was using only by way of example). Basically this solution comes down to using a list, and this seems reasonably comprehensive. I marked as correct this answer for including the parser. – Richard H Dec 17 '09 at 20:47
  • Yeah, I didn't know about public suffixes either. +1 from me. – danben Dec 18 '09 at 03:43
  • I suspect there is a bug in org.apache.http.impl.cookie.PublicSuffixFilter, or I have misunderstood it. See my answer below. – Iain Jan 28 '11 at 04:04
  • The URL to the suffix list appears to have changed subtly since this was originally answered. The list can now be found here: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1 – knightpfhor Mar 08 '11 at 04:12
12

After looking at these answers and not being satisfied by them I used the class com.google.common.net.InternetDomainName to subtract the public parts of a domain name from all the parts:

Set<String> nonePublicDomainParts(String uriHost) {
    InternetDomainName fullDomainName = InternetDomainName.from(uriHost);
    InternetDomainName publicDomainName = fullDomainName.publicSuffix();
    Set<String> nonePublicParts = new HashSet<String>(fullDomainName.parts());
    nonePublicParts.removeAll(publicDomainName.parts());
    return nonePublicParts;
}

That class is on maven in the guava library:

    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
        <version>10.0.1</version>
        <scope>compile</scope>
    </dependency>

Internally this class is using a TldPatterns.class which is package private and has the list of top level domains baked into it.

Interestingly, if you look at that classes source at the link below it explicitly lists "police.uk" as a private domain name. This is correct as police.uk is a private domain controlled by the police; else criminals.police.uk will be emailing you asking for your credit card details in relation to their ongoing investigations into card fraud ;)

http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/net/TldPatterns.java?spec=svn8c3cc7e67132f8dcaae4bd214736a8ddf6611769&r=8c3cc7e67132f8dcaae4bd214736a8ddf6611769

evandrix
  • 6,041
  • 4
  • 27
  • 38
simbo1905
  • 6,321
  • 5
  • 58
  • 86
2

The selected answer is the best approach. For those of you that don't want to code it, here is how I have done it.

Firstly, either I don't understand org.apache.http.impl.cookie.PublicSuffixFilter, or there is a bug in it.

Basically if you pass in google.com it correctly returns false. If you pass in google.com.au it incorrectly returns true. The bug is in the code that applies patterns, e.g. *.au.

Here is the checker code based on org.apache.http.impl.cookie.PublicSuffixFilter:

public class TopLevelDomainChecker  {
    private Set<String> exceptions;
    private Set<String> suffixes;

    public void setPublicSuffixes(Collection<String> suffixes) {
        this.suffixes = new HashSet<String>(suffixes);
    }
    public void setExceptions(Collection<String> exceptions) {
        this.exceptions = new HashSet<String>(exceptions);
    }

    /**
     * Checks if the domain is a TLD.
     * @param domain
     * @return
     */
    public boolean isTLD(String domain) {
        if (domain.startsWith(".")) 
            domain = domain.substring(1);

        // An exception rule takes priority over any other matching rule.
        // Exceptions are ones that are not a TLD, but would match a pattern rule
        // e.g. bl.uk is not a TLD, but the rule *.uk means it is. Hence there is an exception rule
        // stating that bl.uk is not a TLD. 
        if (this.exceptions != null && this.exceptions.contains(domain)) 
            return false;


        if (this.suffixes == null) 
            return false;

        if (this.suffixes.contains(domain)) 
            return true;

        // Try patterns. ie *.jp means that boo.jp is a TLD
        int nextdot = domain.indexOf('.');
        if (nextdot == -1)
            return false;
        domain = "*" + domain.substring(nextdot);
        if (this.suffixes.contains(domain)) 
            return true;

        return false;
    }


    public String extractSLD(String domain)
    {
        String last = domain;
        boolean anySLD = false;
        do
        {
            if (isTLD(domain))
            {
                if (anySLD)
                    return last;
                else
                    return "";
            }
            anySLD = true;
            last = domain;
            int nextDot = domain.indexOf(".");
            if (nextDot == -1)
                return "";
            domain = domain.substring(nextDot+1);
        } while (domain.length() > 0);
        return "";
    }
}

And the parser. I renamed it.

/**
 * Parses the list from <a href="http://publicsuffix.org/">publicsuffix.org
 * Copied from http://svn.apache.org/repos/asf/httpcomponents/httpclient/trunk/httpclient/src/main/java/org/apache/http/impl/cookie/PublicSuffixListParser.java
 */
public class TopLevelDomainParser {
    private static final int MAX_LINE_LEN = 256;
    private final TopLevelDomainChecker filter;

    TopLevelDomainParser(TopLevelDomainChecker filter) {
        this.filter = filter;
    }
    public void parse(Reader list) throws IOException {
        Collection<String> rules = new ArrayList();
        Collection<String> exceptions = new ArrayList();
        BufferedReader r = new BufferedReader(list);
        StringBuilder sb = new StringBuilder(256);
        boolean more = true;
        while (more) {
            more = readLine(r, sb);
            String line = sb.toString();
            if (line.length() == 0) continue;
            if (line.startsWith("//")) continue; //entire lines can also be commented using //
            if (line.startsWith(".")) line = line.substring(1); // A leading dot is optional
            // An exclamation mark (!) at the start of a rule marks an exception to a previous wildcard rule
            boolean isException = line.startsWith("!"); 
            if (isException) line = line.substring(1);

            if (isException) {
                exceptions.add(line);
            } else {
                rules.add(line);
            }
        }

        filter.setPublicSuffixes(rules);
        filter.setExceptions(exceptions);
    }
    private boolean readLine(Reader r, StringBuilder sb) throws IOException {
        sb.setLength(0);
        int b;
        boolean hitWhitespace = false;
        while ((b = r.read()) != -1) {
            char c = (char) b;
            if (c == '\n') break;
            // Each line is only read up to the first whitespace
            if (Character.isWhitespace(c)) hitWhitespace = true;
            if (!hitWhitespace) sb.append(c);
            if (sb.length() > MAX_LINE_LEN) throw new IOException("Line too long"); // prevent excess memory usage
        }
        return (b != -1);
    }
}

And finally, how to use it

    FileReader fr = new FileReader("effective_tld_names.dat.txt");
    TopLevelDomainChecker checker = new TopLevelDomainChecker();
    TopLevelDomainParser parser = new TopLevelDomainParser(checker);
    parser.parse(fr);
    boolean result;
    result = checker.isTLD("com"); // true
    result = checker.isTLD("com.au"); // true
    result = checker.isTLD("ltd.uk"); // true
    result = checker.isTLD("google.com"); // false
    result = checker.isTLD("google.com.au"); // false
    result = checker.isTLD("metro.tokyo.jp"); // false
    String sld;
    sld = checker.extractSLD("com"); // ""
    sld = checker.extractSLD("com.au"); // ""
    sld = checker.extractSLD("google.com"); // "google.com"
    sld = checker.extractSLD("google.com.au"); // "google.com.au"
    sld = checker.extractSLD("www.google.com.au"); // "google.com.au"
    sld = checker.extractSLD("www.google.com"); // "google.com"
    sld = checker.extractSLD("foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp"
    sld = checker.extractSLD("moo.foo.bar.hokkaido.jp"); // "foo.bar.hokkaido.jp"
Iain
  • 10,814
  • 3
  • 36
  • 31
1
public static String getTopLevelDomain(String uri) {

InternetDomainName fullDomainName = InternetDomainName.from(uri);
InternetDomainName publicDomainName = fullDomainName.topPrivateDomain();
String topDomain = "";

Iterator<String> it = publicDomainName.parts().iterator();
while(it.hasNext()){
    String part = it.next();
    if(!topDomain.isEmpty())topDomain += ".";
    topDomain += part;
}
return topDomain;
}

Just give the domain, and u will get the top level domain. download jar file from http://code.google.com/p/guava-libraries/

1
  1. the mentioned list + reading the wikipedia updates gives a 98% correct TLD list
  2. going yourself through http://www.iana.org/domains/root/db/ and click each nic and see the latest news gives you the other 2% (like .com.aq and .gov.an)
  3. unfortunately large "free webspace" providers are another thing to take into account e.g. the countless *.blogspot.com domains, if you download the alexa top 100.000 (free csv file) you can at least get a good overview of the most used of these that should get you for a certain percentage covered for these domains (e.g. when comparing alexa rating with stumbleupon pageviews with delicious bookmarks) (alexa sometimes only takes the topdomain while delicious really md5's every url, so 1 alexa --> multiple delicious md5 hashes
  4. apart from that sometimes in the case of twitter, that what goes after the / is also of importance if you are looking for uniqueness to rate something.

Here is a list of the Alexa top 40.000 when the real TLD's are filtered out to give you a feeling: (which means Alexa does NOT count the rating together for the domain for the following) :

bp.blogspot.com---espn.go.com---files.wordpress.com---abcnews.go.com---disney.go.com---troktiko.blogspot.com---en.wordpress.com---api.ning.com---abc.go.com---220.181.38.82---213.174.154.20---abclocal.go.com---feedproxy.google.com/~r---forums.wordpress.com---googleblog.blogspot.com---1.cnm999.com/user/10008---213.174.143.196---92.42.51.201---googlewebmastercentral.blogspot.com---myespn.go.com---213.174.143.197---61.132.221.146---support.wordpress.com---dashboard.wordpress.com---sethgodin.typepad.com---paygo.17zhifu.com/user/10005---go2.wordpress.com---1.1.1.1---movies.go.com---home.comcast.net---googlesystem.blogspot.com---abcfamily.go.com---home.spaces.live.com---196.1.237.210---kaixin001.com/~record---xhamster.com/user/video---gold-oil-commodity.blogspot.com---journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2---206.108.48.238---blog.wordpress.com---67.220.92.21---183.101.80.130---211.94.190.80---youtube-global.blogspot.com---uta-net.com/user/phplib---cinema3satu.blogspot.com---119.147.41.16---sites.google.com/site/sites---kk.iij4u.or.jp/~dyo---220.181.6.19---toontown.go.com---signup.wordpress.com---thesartorialist.blogspot.com---analytics.blogspot.com---ss.iij4u.or.jp/~ceh2---67.220.92.23---gmailblog.blogspot.com---183.99.121.86---vgorode.ru/user/create---61.132.216.243---217.175.53.72---labnol.blogspot.com---adsense.blogspot.com---subscribe.wordpress.com---fimotro.blogspot.com---creators.ning.com---sarkari-naukri.blogspot.com---search.wordpress.com---orange-hiyoko.blogspot.com---cashewmaniakpop.wordpress.com---pixiehollow.go.com---adwords.blogspot.com---202.53.226.102---lorelle.wordpress.com---homestead.com/~site---multiply.com/user/signout---221.231.148.249---183.101.80.77---windowsliveintro.spaces.live.com---124.228.254.234---streaming-web.blogspot.com---id.tianya.cn/user/message---familyfun.go.com---tro-ma-ktiko.blogspot.com---about.ning.com---paygo.17zhifu.com/user/10020---tututina.blogspot.com---toolserver.org/~geohack---superjob.ru/user/resume---ejobs.ro/user/locuri-de-munca---gnula.blogspot.com---alles.or.jp/~uir---chiark.greenend.org.uk/~sgtatham---woork.blogspot.com---88.208.32.218---webstreamingmania.blogspot.com---spaces.live.com---youtube.com/user/RayWilliamJohnson---cloob.com/user/login---asstr.org/~Kristen---getclicky.com/user/login---guesshermuff.blogspot.com---211.98.70.195---222.73.105.196---pp.iij4u.or.jp/~taakii---unsoloclic.blogspot.com---photoshopdisasters.blogspot.com---218.83.161.253---217.16.18.163---217.16.18.207---217.16.28.104---222.73.105.210---youtube.com/user/OldSpice---hubpages.com/user/new---pelisdvdripdd.blogspot.com---95.143.193.60---es.wordpress.com---217.16.18.206---61.147.116.146---damncoolpics.blogspot.com---family.go.com---81.176.235.162---gutteruncensorednewsr.blogspot.com---terselubung.blogspot.com---faisalardhy.blogspot.com---67.220.92.14---goodreads.com/user/show---116.228.55.34---profile.typepad.com---kaixin001.com/~truth---linkbuildersassociated.ning.com---nicotto.jp/user/mypage---ritemail.blogspot.com---hyperboleandahalf.blogspot.com---carscoop.blogspot.com---tubemogul.com/user/dash---press-gr.blogspot.com---81.176.235.164---soapnet.go.com---208.98.30.69---trelokouneli.blogspot.com---help.ning.com---id.tianya.cn/user/register---slovari.yandex.ru/~%D0%BA%D0%BD%D0%B8%D0%B3%D0%B8---printable-coupons.blogspot.com---unic77.blogspot.com---globaleconomicanalysis.blogspot.com---183.101.80.68---221.194.33.60---doujin-games88.blogspot.com---magaseek.com/user/SearchProducts---files.posterous.com---wwwnew.splinder.com---kolom-tutorial.blogspot.com---strobist.blogspot.com---67.21.91.73---needanarticle.com/user/activity---forum.moe.gov.om/~moeoman---milasdaydreams.blogspot.com---88.208.17.189---67.220.92.22---115.238.100.211---nonews-news.blogspot.com---testosterona.blog.br---nn.iij4u.or.jp/~has---cs.tut.fi/~jkorpela---youtube.com/user/oldspice---67.159.53.25---taxalia.blogspot.com---208.98.30.70---filmesporno.blog.br---alles-schallundrauch.blogspot.com---vatera.hu/user/account---78.140.136.182---us.my.alibaba.com/user/join---stores.homestead.com---pes2008editing.blogspot.com---ocn.ne.jp/~matrix---adweek.blogs.com---115.238.55.94---markjaquith.wordpress.com---k3.dion.ne.jp/~dreamlov---38.99.186.222---film.tv.it---android-developers.blogspot.com---217.218.110.147---kadokado.com/user/login---bollyvideolinks4u.blogspot.com---sookyeong.wordpress.com---87.101.230.11---livecodes.blogspot.com---67.220.91.19---homepage2.nifty.com/bustered---pp.iij4u.or.jp/~manga100---110.173.49.202---erogamescape.dyndns.org/~ap2---cs.berkeley.edu/~lorch---cakewrecks.blogspot.com---59.106.117.185---119.75.213.61---id.wordpress.com---de.wordpress.com---telefilmdblink.blogspot.com---61.139.105.138---multiply.com/user/join---programseo.blogspot.com---collectivebias.ning.com---bablorub.blogspot.com---thinkexist.com/user/personalAccount---us.my.alibaba.com/user/sign---66.70.56.90---getsarkari-naukri.blogspot.com---59.106.117.183---productreviewplace.ning.com---support.weebly.com---kaixin001.com/~lucky---football-russia.blogspot.com---magaseek.com/user/ItemDetail---polprav.blogspot.com---atlasshrugs2000.typepad.com---jpn-manga.blogspot.com---88.208.32.219---google-latlong.blogspot.com---59.106.117.188---erogamescape.ddo.jp/~ap2---218.87.32.245---watchhorrormovies.blogspot.com---sarotiko.blogspot.com---googlewebmastercentral-de.blogspot.com---colmeia.blog.br---us.my.alibaba.com/user/webatm---220.170.79.109---darkville.blogspot.com---youtube.com/user/PiMPDailyDose---disneymovierewards.go.com---fukuoka.lg.jp---61.147.115.16---iisc.ernet.in---youtube.com/user/HuskyStarcraft---202.108.212.211---homepage3.nifty.com/otakarando---94.77.215.37---pitchit.ning.com---59.106.117.186---thestar.blogs.com---1.254.254.254---piratesonline.go.com---animedblink.blogspot.com---137.32.44.152---eurus.dti.ne.jp/~yoneyama---state.la.us---lastminute.is.it---bangpai.taobao.com/user/groups---csse.monash.edu.au/~jwb---jquery-howto.blogspot.com---sakura.ne.jp/~moesino---users.skynet.be/mgueury---saitama.lg.jp---portaldasfinancas.gov.pt---bnonline.fi.cr---135.125.60.11---zhuhai.gd.cn---kuna.net.kw---59.175.213.77---58.218.199.7---multiply.com/user/signin---youtube.com/user/HDstarcraft---blinklist.com/user/join---us.my.alibaba.com/user/company---jptwitterhelp.blogspot.com---67.220.92.017---88.208.17.51---youtube.com/user/GoogleWebmasterHelp---208.53.156.229---filmdblink.blogspot.com---blinklist.com/user/signup---3arbtop.blogspot.com---attivissimo.blogspot.com---onlinemovie12.blogspot.com---98.126.189.86---mytvsource.blogspot.com---blinklist.com/user/login---googlejapan.blogspot.com---76.73.65.166---gutteruncensorednewsb.blogspot.com---issuu.com/user/upload---86.51.174.18---88.208.17.120---profile.china.alibaba.com/user/admin---jntuworldportal.blogspot.com---sz.js.cn---disneymovieclub.go.com---a1.com.mk---dd.iij4u.or.jp/~madonna---rr.iij4u.or.jp/~plasma---mlmlaunchformula.ning.com---112.78.7.151---blogdelatele.blogspot.com---googlemobile.blogspot.com---78.109.199.240---wsu.edu/~brians---internapoli-city.blogspot.com---hh.iij4u.or.jp/~dmt---kaixin001.com/~house---61.155.11.14---youtube.com/user/SHAYTARDS---turbobit.net/user/files---qjy168.com/user/do---hubpages.com/user/finished---upload2.dyndns.org---f32.aaa.livedoor.jp/~azusa---naruto-spoilers.blogspot.com---205.209.140.195---193.227.20.21---adsenseforfeeds.blogspot.com---group.ameba.jp/user/groups---

edelwater
  • 2,650
  • 8
  • 39
  • 67
0

I don't have an answer for your specific case — and Jonathan's comment points out that you should probably refactor your question.

Still, I suggest taking a look at the Reference class of the Restlet project. It has a ton of useful methods. And since Restlet is Open Source, you wouldn't have to use the entire library — you could download the source and add just that one class to your project.

Avi Flax
  • 50,872
  • 9
  • 47
  • 64
0

Dnspy is another more flexible alternative to the publicsuffix lib.

sandyp
  • 432
  • 5
  • 14
0

1.

Method nonePublicDomainParts from simbo1905 contribution should be corrected because of TLD that contain ".", for example "com.ac":

input: "com.abc.com.ac"

output: "abc"

correct output is "com.abc".

To get SLD you may cut TLD from a given domain using method publicSuffix().

2.

A set should not be used because of domains that contain the same parts, for example:

input: part1.part2.part1.TLD

output: part1, part2

correct output is: part1, part2, part1 or in the form part1.part2.part1

So instead of Set<String> use List<String>.

amesh
  • 1,311
  • 3
  • 21
  • 51
-2

If you want the second-level domain, you can split the string on "." and take the last two parts. Of course, this assumes you always have a second-level domain that is not specific to the site (since it sounds like that's what you want).

danben
  • 80,905
  • 18
  • 123
  • 145
  • sadly, this is a general problem - so I need to match .com, .co.uk, .uk (e.g as in police.uk) etc – Richard H Dec 17 '09 at 19:05
  • In that case, you could try to build a a hash from this list - https://wiki.mozilla.org/TLD_List and then check the last two parts for a second level domain match, otherwise take the last part as the TLD. – danben Dec 17 '09 at 19:09
  • yes i don't think there's a programmatic solution to this beyond building a big list to match against – Richard H Dec 17 '09 at 19:16