8

In a normal URL, you have a protocol, subdomains (optional), domain name, top level domain and subdirectories.

For example: http://www.google.com/path. Here www is subdomain, google is domain name and com is TLD; path is subdirectory here. Parsing this is simple programming task.

But the problem comes when there are more than one TLD's. For example: www.google.co.in/path. Here co.in is TLD. But I see that there is a website with name www.co.in also present.

My doubts are:

  • How many Top level domains can a URL have? In a URL how to find the top level domains, if there could be multiple TLDs?
  • In the above example google.co.in is not a subdomain of co.in, so how come www.co.in is resolving to a different website than google.co.in?
unor
  • 92,415
  • 26
  • 211
  • 360
kumar
  • 2,696
  • 3
  • 26
  • 34
  • 3
    Actually, only the last part of the domain name is the TLD, always. Some countries do enforce secondary top-level types (like .co.uk), but the it's always the last part that is the TLD (.uk in my previous example). – Some programmer dude Jun 27 '14 at 09:03
  • What are the criteria for secondary TLD's . For ex: I want to parse "www.google.com" to google.com , but "code.google.co.uk" to "google.co.uk"? Is second level domain only allowed under country code? – kumar Jun 27 '14 at 09:11
  • 1
    The criteria are whatever the TLD registrar wants them to be, and may not even be fully consistent. For example in the UK, most businesses are under `.co.uk`, but parliament is `www.parliament.uk` (not `.gov.uk`, as a matter of constitutional principle), and http://parliament.uk works, so there isn't necessarily a `www` part. The best you'll do is a country-by-country heuristic, I think. – Norman Gray Jun 27 '14 at 10:15

2 Answers2

2

If I would have to write an algorithm that decides that "www.co.in" belongs to India Top Level Domain (TLD) and "www.google.co.in" belongs to India Second Level Domain (SLD), I would go here and grab the list:

https://wiki.mozilla.org/TLD_List

Then, I would process my URL like this:

  1. Compare the the last part of the URL to all TLDs in the list and find a matching one. [www.google.co.in -> in, www.co.in -> in]
  2. If no TLD was found, the URL is invalid.
  3. If a TLD was found and the URL has three parts or less, return the TLD as result and exit.
  4. If a TLD was found and the URL has more than three parts, do a second search in the list of SLDs. Compare the end of the URL against the pattern ".SLD.TLD".
  5. If no entry was found, return the TLD as result and exit.
  6. If an entry was found, return SLD.TLD as result and exit.
peter_the_oak
  • 3,529
  • 3
  • 23
  • 37
0

Very slow yet comprehensive regex you could use: (sourced from Wikipedia and Mozilla)

[a-z0-9-]{1,63}(.ab.ca|.bc.ca|.mb.ca|.nb.ca|.nf.ca|.nl.ca|.ns.ca|.nt.ca|.nu.ca|.on.ca|.pe.ca|.qc.ca|.sk.ca|.yk.ca|.co.cc|.com.cd|.net.cd|.org.cd|.co.ck|.ac.cn|.com.cn|.edu.cn|.gov.cn|.net.cn|.org.cn|.ah.cn|.bj.cn|.cq.cn|.fj.cn|.gd.cn|.gs.cn|.gz.cn|.gx.cn|.ha.cn|.hb.cn|.he.cn|.hi.cn|.hl.cn|.hn.cn|.jl.cn|.js.cn|.jx.cn|.ln.cn|.nm.cn|.nx.cn|.qh.cn|.sc.cn|.sd.cn|.sh.cn|.sn.cn|.sx.cn|.tj.cn|.xj.cn|.xz.cn|.yn.cn|.zj.cn|.us.com|.com.cu|.edu.cu|.org.cu|.net.cu|.gov.cu|.inf.cu|.gov.cx|.com.dz|.org.dz|.net.dz|.gov.dz|.edu.dz|.asso.dz|.pol.dz|.art.dz|.com.ec|.info.ec|.net.ec|.fin.ec|.med.ec|.pro.ec|.org.ec|.edu.ec|.gov.ec|.mil.ec|.com.ee|.org.ee|.fie.ee|.pri.ee|.com.es|.nom.es|.org.es|.gob.es|.edu.es|.aland.fi|.tm.fr|.asso.fr|.nom.fr|.prd.fr|.presse.fr|.com.fr|.gouv.fr|.com.ge|.edu.ge|.gov.ge|.org.ge|.mil.ge|.net.ge|.pvt.ge|.co.gg|.net.gg|.org.gg|.com.gi|.ltd.gi|.gov.gi|.mod.gi|.edu.gi|.org.gi|.com.gp|.net.gp|.edu.gp|.asso.gp|.org.gp|.com.gr|.edu.gr|.net.gr|.org.gr|.gov.gr|.com.hk|.edu.hk|.gov.hk|.idv.hk|.net.hk|.org.hk|.com.hn|.edu.hn|.org.hn|.net.hn|.mil.hn|.gob.hn|.iz.hr|.from.hr|.name.hr|.com.hr|.com.ht|.net.ht|.firm.ht|.shop.ht|.info.ht|.pro.ht|.adult.ht|.org.ht|.art.ht|.pol.ht|.rel.ht|.asso.ht|.perso.ht|.coop.ht|.med.ht|.edu.ht|.gouv.ht|.gov.ie|.co.in|.firm.in|.net.in|.org.in|.gen.in|.ind.in|.nic.in|.ac.in|.edu.in|.res.in|.gov.in|.mil.in|.ac.ir|.co.ir|.gov.ir|.net.ir|.org.ir|.sch.ir|.co.je|.net.je|.org.je|.com.jo|.org.jo|.net.jo|.edu.jo|.gov.jo|.mil.jo|.co.kr|.or.kr|.edu.ky|.gov.ky|.com.ky|.org.ky|.net.ky|.gov.lk|.sch.lk|.net.lk|.int.lk|.com.lk|.org.lk|.edu.lk|.ngo.lk|.soc.lk|.web.lk|.ltd.lk|.assn.lk|.grp.lk|.hotel.lk|.gov.lt|.mil.lt|.gov.lu|.mil.lu|.org.lu|.net.lu|.com.lv|.edu.lv|.gov.lv|.org.lv|.mil.lv|.id.lv|.net.lv|.asn.lv|.conf.lv|.com.ly|.net.ly|.gov.ly|.plc.ly|.edu.ly|.sch.ly|.med.ly|.org.ly|.id.ly|.co.ma|.net.ma|.gov.ma|.org.ma|.tm.mc|.asso.mc|.org.mg|.nom.mg|.gov.mg|.prd.mg|.tm.mg|.com.mg|.edu.mg|.mil.mg|.com.mk|.org.mk|.com.mo|.net.mo|.org.mo|.edu.mo|.gov.mo|.org.mt|.com.mt|.gov.mt|.edu.mt|.net.mt|.com.mu|.co.mu|.gov.nr|.edu.nr|.biz.nr|.info.nr|.com.nr|.net.nr|.com.pf|.org.pf|.edu.pf|.com.ph|.gov.ph|.com.pk|.net.pk|.edu.pk|.org.pk|.fam.pk|.biz.pk|.web.pk|.gov.pk|.gob.pk|.gok.pk|.gon.pk|.gop.pk|.gos.pk|.com.pl|.biz.pl|.net.pl|.art.pl|.edu.pl|.org.pl|.ngo.pl|.gov.pl|.info.pl|.mil.pl|.waw.pl|.warszawa.pl|.wroc.pl|.wroclaw.pl|.krakow.pl|.poznan.pl|.lodz.pl|.gda.pl|.gdansk.pl|.slupsk.pl|.szczecin.pl|.lublin.pl|.bialystok.pl|.olsztyn.pl.torun.pl|.biz.pr|.com.pr|.edu.pr|.gov.pr|.info.pr|.isla.pr|.name.pr|.net.pr|.org.pr|.pro.pr|.edu.ps|.gov.ps|.sec.ps|.plo.ps|.com.ps|.org.ps|.net.ps|.com.pt|.edu.pt|.gov.pt|.int.pt|.net.pt|.nome.pt|.org.pt|.publ.pt|.com.ro|.org.ro|.tm.ro|.nt.ro|.nom.ro|.info.ro|.rec.ro|.arts.ro|.firm.ro|.store.ro|.www.ro|.com.ru|.net.ru|.org.ru|.pp.ru|.msk.ru|.int.ru|.ac.ru|.gov.rw|.net.rw|.edu.rw|.ac.rw|.com.rw|.co.rw|.int.rw|.mil.rw|.gouv.rw|.com.sc|.gov.sc|.net.sc|.org.sc|.edu.sc|.com.sd|.net.sd|.org.sd|.edu.sd|.med.sd|.tv.sd|.gov.sd|.info.sd|.org.se|.pp.se|.tm.se|.brand.se|.parti.se|.press.se|.komforb.se|.kommunalforbund.se|.komvux.se|.lanarb.se|.lanbib.se|.naturbruksgymn.se|.sshn.se|.fhv.se|.fhsk.se|.fh.se|.mil.se|.ab.se|.c.se|.d.se|.e.se|.f.se|.g.se|.h.se|.i.se|.k.se|.m.se|.n.se|.o.se|.s.se|.t.se|.u.se|.w.se|.x.se|.y.se|.z.se|.ac.se|.bd.se|.com.sg|.net.sg|.org.sg|.gov.sg|.edu.sg|.per.sg|.idn.sg|.ac.tj|.biz.tj|.com.tj|.co.tj|.edu.tj|.int.tj|.name.tj|.net.tj|.org.tj|.web.tj|.gov.tj|.go.tj|.mil.tj|.gov.to|.gov.tp|.co.tt|.com.tt|.org.tt|.net.tt|.biz.tt|.info.tt|.pro.tt|.name.tt|.edu.tt|.gov.tt|.gov.tv|.edu.tw|.gov.tw|.mil.tw|.com.tw|.net.tw|.org.tw|.idv.tw|.game.tw|.ebiz.tw|.club.tw|.com.ua|.gov.ua|.net.ua|.edu.ua|.org.ua|.cherkassy.ua|.ck.ua|.chernigov.ua|.cn.ua|.chernovtsy.ua|.cv.ua|.crimea.ua|.dnepropetrovsk.ua|.dp.ua|.donetsk.ua|.dn.ua|.ivano-frankivsk.ua|.if.ua|.kharkov.ua|.kh.ua|.kherson.ua|.ks.ua|.khmelnitskiy.ua|.km.ua|.kiev.ua|.kv.ua|.kirovograd.ua|.kr.ua|.lugansk.ua|.lg.ua|.lutsk.ua|.lviv.ua|.nikolaev.ua|.mk.ua|.odessa.ua|.od.ua|.poltava.ua|.pl.ua|.rovno.ua|.rv.ua|.sebastopol.ua|.sumy.ua|.ternopil.ua|.te.ua|.uzhgorod.ua|.vinnica.ua|.vn.ua|.zaporizhzhe.ua|.zp.ua|.zhitomir.ua|.zt.ua|.co.ug|.ac.ug|.sc.ug|.go.ug|.ne.ug|.or.ug|.ak.us|.al.us|.ar.us|.az.us|.ca.us|.co.us|.ct.us|.dc.us|.de.us|.dni.us|.fed.us|.fl.us|.ga.us|.hi.us|.ia.us|.id.us|.il.us|.in.us|.isa.us|.kids.us|.ks.us|.ky.us|.la.us|.ma.us|.md.us|.me.us|.mi.us|.mn.us|.mo.us|.ms.us|.mt.us|.nc.us|.nd.us|.ne.us|.nh.us|.nj.us|.nm.us|.nsn.us|.nv.us|.ny.us|.oh.us|.ok.us|.or.us|.pa.us|.ri.us|.sc.us|.sd.us|.tn.us|.tx.us|.ut.us|.vt.us|.va.us|.wa.us|.wi.us|.wv.us|.wy.us|.com.vi|.org.vi|.edu.vi|.gov.vi|.com.vn|.net.vn|.org.vn|.edu.vn|.gov.vn|.int.vn|.ac.vn|.biz.vn|.info.vn|.name.vn|.pro.vn|.health.vn|.com|.org|.net|.int|.edu|.gov|.mil|.arpa|.ac|.ad|.ae|.af|.ag|.ai|.al|.am|.an|.ao|.aq|.ar|.as|.at|.au|.aw|.ax|.az|.ba|.bb|.bd|.be|.bf|.bg|.bh|.bi|.bj|.bm|.bn|.bo|.br|.bs|.bt|.bw|.by|.bz|.ca|.cc|.cd|.cf|.cg|.ch|.ci|.ck|.cl|.cm|.cn|.co|.cr|.cu|.cv|.cw|.cx|.cy|.cz|.de|.dj|.dk|.dm|.do|.dz|.ec|.ee|.eg|.es|.et|.eu|.fi|.fj|.fk|.fm|.fo|.fr|.ga|.gd|.ge|.gf|.gg|.gh|.gi|.gl|.gm|.gn|.gp|.gq|.gr|.gs|.gt|.gu|.gw|.gy|.hk|.hm|.hn|.hr|.ht|.hu|.id|.ie|.il|.im|.in|.io|.iq|.ir|.is|.it|.je|.jm|.jo|.jp|.ke|.kg|.kh|.ki|.km|.kn|.kp|.kr|.kw|.ky|.kz|.la|.lb|.lc|.li|.lk|.lr|.ls|.lt|.lu|.lv|.ly|.ma|.mc|.md|.me|.mg|.mh|.mk|.ml|.mm|.mn|.mo|.mp|.mq|.mr|.ms|.mt|.mu|.mv|.mw|.mx|.my|.mz|.na|.nc|.ne|.nf|.ng|.ni|.nl|.no|.np|.nr|.nu|.nz|.om|.pa|.pe|.pf|.pg|.ph|.pk|.pl|.pm|.pn|.pr|.ps|.pt|.pw|.py|.qa|.re|.ro|.rs|.ru|.rw|.sa|.sb|.sc|.sd|.se|.sg|.sh|.si|.sk|.sl|.sm|.sn|.so|.sr|.ss|.st|.su|.sv|.sx|.sy|.sz|.tc|.td|.tf|.tg|.th|.tj|.tk|.tl|.tm|.tn|.to|.tr|.tt|.tv|.tw|.tz|.ua|.ug|.us|.uy|.uz|.va|.vc|.ve|.vg|.vi|.vn|.vu|.wf|.ws|.ye|.yt|.za|.zm|.zw|.dz|.am|.bh|.bd|.by|.bg|.cn|.cn|.eg|.eu|.ge|.gr|.hk|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.ir|.iq|.jo|.kz|.mo|.mo|.my|.mr|.mn|.ma|.mk|.om|.pk|.ps|.qa|.ru|.sa|.rs|.sg|.sg|.kr|.lk|.lk|.sd|.sy|.tw|.tw|.th|.tn|.ua|.ae|.ye|.academy|.accountant|.adult|.aero|.africa|.agency|.apartments|.app|.archi|.associates|.audio|.auto|.bar|.bargains|.bible|.bike|.biz|.black|.blackfriday|.blog|.blue|.builders|.cam|.cam|.camera|.camp|.cancerresearch|.car|.cards|.cars|.center|.cheap|.christmas|.church|.click|.clothing|.cloud|.club|.codes|.coffee|.college|.coop|.country|.dance|.date|.dating|.design|.dev|.diet|.directory|.download|.eco|.education|.email|.events|.exchange|.exposed|.faith|.farm|.flowers|.game|.gdn|.gift|.glass|.global|.gop|.green|.guitars|.guru|.help|.hiphop|.hiv|.holdings|.hosting|.house|.info|.ink|.international|.jobs|.kim|.land|.lgbt|.life|.lighting|.link|.live|.loan|.lol|.love|.map|.market|.med|.meet|.menu|.mobi|.moe|.mom|.movie|.museum|.music|.name|.new|.NGO_and_.ONG|.org_(top-level_domain)|.one|.one|.onl|.ooo|.organic|.pharmacy|.photo|.photos|.pics|.pink|.pizza|.plumbing|.porn|.post|.pro|.properties|.property|.realtor|.rich|.rocks|.sale|.science|.sex|.sexy|.shop|.singles|.social|.solar|.stream|.sucks|.support|.tattoo|.tel|.today|.top|.travel|.ventures|.video|.voting|.wedding|.wiki|.win|.work|.wtf|.xxx|.XYZ|.kaufen|.desi|.shiksha|.moda|.futbol|.juegos|.uno|.africa|.asia|.krd|.taipei|.tokyo|.alsace|.amsterdam|.bcn|.barcelona|.berlin|.brussels|.bzh|.cat|.cymru|.eus|.frl|.gal|.gent|.irish|.istanbul|.istanbul|.london|.paris|.saarland|.scot|.swiss|.wales|.wien|.miami|.nyc|.quebec|.vegas|.kiwi|.melbourne|.sydney|.lat|.rio|.ru|.aaa|.abb|.aeg|.afl|.aig|.airtel|.bbc|.bentley|.example|.invalid|.local|.localhost|.onion|.testa)$
creed
  • 172
  • 2
  • 13