2

Well, I use the idna_convert PHP class (http://idnaconv.net/index.html) in order to encode / decode domain names.

Unfortunately, it doesn't seem to provide an interface to check whether a domain name is already punycode or not.

What's the best way to achieve this? It would be nice if someone could post source code how to verify a domain is punycode or not (with explanation, because the idna_convert code is not really clear to me). I already know how to catch the exception from idna_convert. :-)

Btw.: idna_convert throws an exception when you try to convert a domain name to punycode that is already punycode (see https://github.com/phlylabs/idna-convert/blob/master/src/Punycode.php; line 157). Moreover, I do not really understand how their check works.

Andreas
  • 2,821
  • 25
  • 30
  • 1
    May be try php idn_to_utf8 function and compare the output with imput? http://php.net/manual/en/function.idn-to-utf8.php – Pavel Petrov May 13 '16 at 12:49
  • @PavelPetrov: Thanks, this function looks interesting and much better than catching an exception. :-) – Andreas May 13 '16 at 13:03
  • @Andreas but it can produce wrong result because punycode is not only converting to unicode. Othervise, idna_convert won't be needed, you know. – Jehy May 13 '16 at 16:19
  • 1
    You could always just remove the offending code from the library. The code that raises the exception is completely unnecessary and should be removed. The decoder doesn't check to see if a domain is already decoded, so that library is internally inconsistent with itself. Or just write your own. There's some rather janky example C code in RFC3492 that's easy enough to port. [That's what I did](https://github.com/cubiclesoft/php-misc/blob/master/support/utf_utils.php). It's only ~300 lines of code to avoid an unnecessary dependency on a third party. – CubicleSoft May 23 '21 at 13:29
  • Punycode is an algorithm, not an attribute of a domain. A domain is either in LDH form, or not. If not, it is an IDN that needs to be converted by applying IDNA rules. If it is ASCII only and starts with `xn--` then it means it was an IDN converted – Patrick Mevzek Nov 17 '22 at 19:41

4 Answers4

1

The simplest way - just convert it anyway and check if the result is equal to input.

EDIT: You can extend Punycode class with a check like this:

class PunycodeCheck extends Punycode
{
  public function check_encoded($decoded)
  {
      $extract = self::byteLength(self::punycodePrefix);
      $check_pref = $this->UnicodeTranscoder->utf8_ucs4array(self::punycodePrefix);
      $check_deco = array_slice($decoded, 0, $extract);
      if ($check_pref == $check_deco) 
          return true;
      return false;
   }
}
Jehy
  • 4,729
  • 1
  • 38
  • 55
  • This is a good proposal, but unfortunately it does not work, because idna_convert throws an exception when the domain is already punycode and you try to encode it. See https://github.com/phlylabs/idna-convert/blob/master/src/Punycode.php (line 157). – Andreas May 13 '16 at 12:08
  • 1
    @Andreas then just catch this exception and check exception text - and you're okay! – Jehy May 13 '16 at 12:10
  • Yeah, but I don't think this is really a valid usage... I think it makes more sense (and is much more straight forward) to first check upfront if the domain needs to be encoded to punycode or if it is already. So I know that catching the exception is a way to solve the problem, but I don't really like that way... – Andreas May 13 '16 at 12:11
  • @Andreas The only valid way to check if domains need to be encoded is to try encoding. If you don't like catching exception, you can add your own function by extending Punycode class with function which returns some special result for already encoded domains instead of throwing exception. – Jehy May 13 '16 at 12:16
  • Well, I know, but this is exactly what I'm asking for? For an example code (with explanation) how to best verify that a domain is punycode or not. I know OOP and how to catch exceptions, extend classes, etc. ;-) – Andreas May 13 '16 at 12:18
  • @Andreas I added a sample function in my answer. – Jehy May 13 '16 at 12:28
1

It depends on what exactly you want.

As first basic check, see if the domain name contains only ASCII characters. If yes, then the domain is "already punycode", in the sense that it can't be further transformed. For checking whether a string only contains ASCII characters, see Determine if UTF-8 text is all ASCII?.

If on top of that, you want to check wether the domain is in the IDN form, split the domain at the dots . and check if any of the substrings starts with xn--.

If in addition to that, you want to check if the domain is IDN and is valid, just attempt to decode it with the library's decode function.

Community
  • 1
  • 1
1

It is not very easy to check if a domain is in Punycode or not. Several checks in needed to implement by rules that are already said by @Wladston.

This is the adapted code examples that I took from ValidateHelper class from the composition of my library: Helper classes for PrestaShop CMS. I have also added the test and the result of its execution.

/**
 * Validate helper.
 *
 * @author Maksim T. <zapalm@yandex.com>
 */
class ValidateHelper
{
    /**
     * Checks if the given domain is in Punycode.
     *
     * @param string $domain The domain to check.
     *
     * @return bool Whether the domain is in Punycode.
     *
     * @see https://developer.mozilla.org/en-US/docs/Mozilla/Internationalized_domain_names_support_in_Mozilla#ASCII-compatible_encoding_.28ACE.29
     *
     * @author Maksim T. <zapalm@yandex.com>
     */
    public static function isPunycodeDomain($domain)
    {
        $hasPunycode = false;

        foreach (explode('.', $domain) as $part) {
            if (false === static::isAscii($part)) {
                return false;
            }

            if (static::isPunycode($part)) {
                $hasPunycode = true;
            }
        }

        return $hasPunycode;
    }

    /**
     * Checks if the given value is in ASCII character encoding.
     *
     * @param string $value The value to check.
     *
     * @return bool Whether the value is in ASCII character encoding.
     *
     * @see https://en.wikipedia.org/wiki/ASCII
     *
     * @author Maksim T. <zapalm@yandex.com>
     */
    public static function isAscii($value)
    {
        return ('ASCII' === mb_detect_encoding($value, 'ASCII', true));
    }

    /**
     * Checks if the given value is in Punycode.
     *
     * @param string $value The value to check.
     *
     * @return bool Whether the value is in Punycode.
     *
     * @throws \LogicException If the string is not encoded by UTF-8.
     *
     * @see https://en.wikipedia.org/wiki/Punycode
     *
     * @author Maksim T. <zapalm@yandex.com>
     */
    public static function isPunycode($value)
    {
        if (false === static::isAscii($value)) {
            return false;
        }

        if ('UTF-8' !== mb_detect_encoding($value, 'UTF-8', true)) {
            throw new \LogicException('The string should be encoded by UTF-8 to do the right check.');
        }

        return (0 === mb_stripos($value, 'xn--', 0, 'UTF-8'));
    }
}

/**
 * Test Punycode domain validator.
 *
 * @author Maksim T. <zapalm@yandex.com>
 */
class Test
{
    /**
     * Run the test.
     *
     * @author Maksim T. <zapalm@yandex.com>
     */
    public static function run()
    {
        $domains = [
            // White list
            'почта@престашоп.рф'          => false, // Russian, Unicode
            'modulez.ru'                  => false, // English, ASCII
            'xn--80aj2abdcii9c.xn--p1ai'  => true,  // Russian, ASCII
            'xn--80a1acn3a.xn--j1amh'     => true,  // Ukrainian, ASCII
            'xn--srensen-90a.example.com' => true,  // German, ASCII
            'xn--mxahbxey0c.xn--xxaf0a'   => true,  // Greek, ASCII
            'xn--fsqu00a.xn--4rr70v'      => true,  // Chinese, ASCII

            // Black List
            'xn--престашоп.xn--рф'        => false, // Russian, Unicode
            'xn--prestashop.рф'           => false, // Russian, Unicode
        ];

        foreach ($domains as $domain => $isPunycode) {
            echo 'TEST: ' . $domain . (ValidateHelper::isPunycodeDomain($domain)
                ? ' is in Punycode [' . ($isPunycode ? 'OK' : 'FAIL') . ']'
                : ' is NOT in Punycode [' . (false === $isPunycode ? 'OK' : 'FAIL') . ']'
            ) . PHP_EOL;
        }
    }
}

Test::run();

// The output result:
//
// TEST: почта@престашоп.рф is NOT in Punycode [OK]
// TEST: modulez.ru is NOT in Punycode [OK]
// TEST: xn--80aj2abdcii9c.xn--p1ai is in Punycode [OK]
// TEST: xn--80a1acn3a.xn--j1amh is in Punycode [OK]
// TEST: xn--srensen-90a.example.com is in Punycode [OK]
// TEST: xn--mxahbxey0c.xn--xxaf0a is in Punycode [OK]
// TEST: xn--fsqu00a.xn--4rr70v is in Punycode [OK]
// TEST: xn--престашоп.xn--рф is NOT in Punycode [OK]
// TEST: xn--prestashop.рф is NOT in Punycode [OK]
Maksim T.
  • 168
  • 8
0

The only exception that the encode() method throws is when the domain is already punycode. So you can do the following:

try {
    $punycode->encode($decoded);
} catch (\InvalidArgumentException $e) {
    //do whatever is needed when already punycode
    //or do nothing
}

However it's a workaround solution.

Pavel Petrov
  • 847
  • 10
  • 19
  • I agree, but in my opinion it would be much better to check upfront if the domain is already punycode or not. Catching the InvalidArgumentException seems rather.... well, dirty. – Andreas May 13 '16 at 12:25
  • I agree with, it's just the first thing that came in mind to solve the problem. – Pavel Petrov May 13 '16 at 12:27