The Latin script supports many languages, and I would like to make sure that input characters are within a language (e.g. English or German), not just within the Latin script.
Unicode is divided into blocks and blocks are not necessarily language specific. America and Europe use the Basic Latin and Latin-1 Supplement blocks, but in particular French characters with accents are mixed with German characters with accents in the Latin-1 supplement block. So if I want just French characters, do I have to construct my own array of legitimate characters or is there a resource somewhere for that (and all other languages)?
The IntlChar class gets closer but does not solve this problem. You can obtain the Unicode block as a property from each character that is parsed. But it would be nice if IntlChar were locale-aware, since the locale string would specify a language and perhaps give more precision. I know IntlChar is based on an ICU library, and so the PHP language is unlikely to change its implementation.
use PHPUnit\Framework\TestCase;
class CharacterTest extends TestCase {
function testFrenchCharacter() {
$e_with_acute = "\u{00E9}";
$snowman = "\u{2603}";
$this->assertFalse(ctype_alpha($e_with_acute));
setLocale(LC_CTYPE, 'fr-FR');
// ctype_alpha is NOT locale aware
$this->assertFalse(ctype_alpha($e_with_acute));
// \IntlChar::isalpha is not locale aware either but handles Unicode characters
$this->assertFalse(\IntlChar::isalpha($snowman));
$this->assertEquals(\IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER, \IntlChar::charType($e_with_acute));
$this->assertTrue(\IntlChar::isalpha($e_with_acute));
}
}