Does PHP offer a way to determine if a unicode codepoint belongs to a particular language not just a particular script.?

Question

The Latin script supports many languages, and I would like to make sure that input characters are within a language (e.g. English or German), not just within the Latin script.

Unicode is divided into blocks and blocks are not necessarily language specific. America and Europe use the Basic Latin and Latin-1 Supplement blocks, but in particular French characters with accents are mixed with German characters with accents in the Latin-1 supplement block. So if I want just French characters, do I have to construct my own array of legitimate characters or is there a resource somewhere for that (and all other languages)?

The IntlChar class gets closer but does not solve this problem. You can obtain the Unicode block as a property from each character that is parsed. But it would be nice if IntlChar were locale-aware, since the locale string would specify a language and perhaps give more precision. I know IntlChar is based on an ICU library, and so the PHP language is unlikely to change its implementation.

use PHPUnit\Framework\TestCase;

class CharacterTest extends TestCase {

    function testFrenchCharacter() {
        $e_with_acute = "\u{00E9}";
        $snowman = "\u{2603}";

        $this->assertFalse(ctype_alpha($e_with_acute));

        setLocale(LC_CTYPE, 'fr-FR');

        // ctype_alpha is NOT locale aware
        $this->assertFalse(ctype_alpha($e_with_acute));

        // \IntlChar::isalpha is not locale aware either but handles Unicode characters
        $this->assertFalse(\IntlChar::isalpha($snowman));

        $this->assertEquals(\IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER, \IntlChar::charType($e_with_acute));

        $this->assertTrue(\IntlChar::isalpha($e_with_acute));

    }

}

There's a problem with this: a character can belong to multiple languages - `é [LATIN SMALL LETTER E WITH ACUTE]` could be French, or any Slavic language, or maybe Portuguese. — Piskvor left the building, Jul 03 '19 at 14:19
This is a risqué proposition and opens up a smörgåsbord of potential issues. — See what I did there? This was "English"… ;) — deceze, Jul 03 '19 at 15:16
Your best bet is to define own valid character set and validate against it. Partial solution is to use unicode codepoint ranges check with regex. Like I would use `preg_match('#[a-žA-Ž]#u', $character)` for a simple test if a character belongs to a Lithuanian language. Problem with this approach is that Lithuanian letters are scattered across different unicode codepoint blocks, so this regex will also match other unicode characters which are in the same `0x0061 - 0x017E` range - such as `µ¶¿¾` and etc. But if super-quality is not your concern - this may be ok. — Agnius Vasiliauskas, Jul 03 '19 at 15:47
In response to Piskvor's answer, I don't see a problem with a one to many relationship between codepoints and languages (because that is obviously the case, as Piskvor points out!). To deceze's point, I get it (clever reply!). But I think there could be times when you might ideally want text input restricted to a particular language. But based on these responses, it sounds like a regex or an array of characters is the only way to accomplish it. And obviously that creates a lot of work if you intend to support multiple languages in your app. — Doug Wilbourne, Jul 04 '19 at 19:49

Does PHP offer a way to determine if a unicode codepoint belongs to a particular language not just a particular script.?

0 Answers0