1

I'm trying to extract the subdomain from the HTTP_HOST value. However I've stumbled into a problem where if the subdomain has more than one dot in it it fails to match properly. Given that this is a script to run on multiple different domains and it could have an unlimited amount of dots, and the tld could be either 1 or 2 parts (and any length) - is there a practical way of correctly matching the subdomain, domain and tld in all situations?

So for example take the following HTTP_HOST values and what is required to be matched.

  • www.buggedcom.co.uk
    • Subdomain: www
    • Domain: buggedcom.co.uk
    • TLD: co.uk
  • www.buggedcom.com
    • Subdomain: www
    • Domain: buggedcom.com
    • TLD: com
  • test.buggedcom.co.uk
    • Subdomain: test
    • Domain: buggedcom.co.uk
    • TLD: co.uk
  • test.buggedcom.com
    • Subdomain: test
    • Domain: buggedcom.com
    • TLD: com
  • multi.sub.test.buggedcom.co.uk
    • Subdomain: multi.sub.test
    • Domain: buggedcom.co.uk
    • TLD: co.uk
  • multi.sub.test.buggedcom.com
    • Subdomain: multi.sub.test
    • Domain: buggedcom.com
    • TLD: com

I am presuming that the only way to accomplish this would be to load a list of tlds, which allow possible I don't really want to do as this is at the start of a script and should really require heavy lifting like that.

Below is the current code.

define('HOST', isset($_SERVER['HTTP_HOST']) === true ? $_SERVER['HTTP_HOST'] : (isset($_SERVER['SERVER_ADDR']) === true ? $_SERVER['SERVER_ADDR'] : $_SERVER['SERVER_NAME']));
$domain_parts = explode('.', HOST); 
$domain_parts_count = count($domain_parts);
if($domain_parts_count > 1)
{   
    $sub_parts = array_splice($domain_parts, 0, $domain_parts_count-3);
    define('SUBDOMAIN', implode('.', $sub_parts));
    unset($sub_parts);
}
else
{
    define('SUBDOMAIN', '');
}
define('DOMAIN', implode('.', $domain_parts));
var_dump($domain_parts, SUBDOMAIN, DOMAIN);exit;

Just thought could mod_rewrite append the subdomain as a get param?

buggedcom
  • 1,537
  • 2
  • 18
  • 34
  • If the site was aware of its proper domain (in this case, "buggedcom") this would be trivial. Is there no way to require this in some sort of application configuration file? – bzlm Aug 05 '10 at 13:23
  • The cms has a multi site architecture. The actual site url is loaded out of the database further down the configuration and it is based on the host only. I suppose the subdomain/tld definitions could be moved further down the page. – buggedcom Aug 05 '10 at 13:47

4 Answers4

1

First of all I would explode(and use the first index in the array) on a slash just to be sure that the string ends with the TLD.

Then I would cut it with a preg_replace. This rexexp matches the domain+tld regardless of tld type. Beware however this would give a problem with 2&3 letter domains. But it should give a push to the right direction....

[a-zA-Z0-9]+\.(([a-zA-Z]{2,6})|([a-zA-Z]{2,3}\.[a-zA-Z]{2,3}))$

Edit: as pointed out: .museum is also possible, so edited the first pattern in the TLD part....

And of course TLD's like .UK could behave differently then co.uk ugh.. it's not that easy...

Deefjuh
  • 159
  • 7
1

I think the solution to this is better handled by those trying to do the same thing... there's a bunch of better URL parsing functions in the comments to PHP docs for parse_url function that might work better: http://www.php.net/manual/en/function.parse-url.php

Tony
  • 41
  • 1
0

With preg_match, you can extract the subdomain and tld parts in one go, like this:

function get_domain_parts($domain) {
    $parts = array();
    $pattern = "/(.*)\.buggedcom\.(.*)/";
    if (preg_match($pattern, $domain, $parts) == 1) {
        return array($parts[1], $parts[2]);
    } else {
        return FALSE;
    }
}

$result = get_domain_parts("multi.sub.test.buggedcom.co.uk");
if ($result) {
    echo($result[0] . " and " . $result[1]); // multi.sub.test and co.uk   
}
André Laszlo
  • 15,169
  • 3
  • 63
  • 81
  • because this won't be run on a definitive domain so I can't check against anything. Also it's run before configuration loads in the base url for various optimization/caching reasons. – buggedcom Aug 05 '10 at 13:20
  • oic, I guess you'll have go go with evolve's solution then :) – André Laszlo Aug 05 '10 at 13:25
0

Not to be nit-picky, but technically speaking .co.uk is a second level domain.

.uk is the "Country Code Top Level Domain" in that case, and the .co is for "Commercial Use" defined by the United Kingdom.

This might not answer your question though.

Wikipedia has a pretty complete list of TLD's, as you can see they only contain 1 "dot" followed by 1 "string".

tplaner
  • 8,363
  • 3
  • 31
  • 47