0

Case

It seems that Spoofchecker from the Intl extention yields false positives:

<?php  // 7.0 on linux
// File encoding of this script is UTF-8 (thus without BOM)
$sDefaultLocale = (new \Locale)->getDefault();
$oSpoofchecker = new \Spoofchecker;
$oSpoofchecker->setAllowedLocales($sDefaultLocale);
$sText = 'abc';  // US-ASCII
header('Content-Type: text/plain');
print
    'Default locale: ' . $sDefaultLocale . PHP_EOL
  . 'Byte length: ' . strlen($sText) . PHP_EOL  // US-ASCII check
  . 'Text "' . $sText . '" '
  . ($oSpoofchecker->isSuspicious($sText, $sError) ? 'IS' : 'IS NOT')
  . ' suspicious' . PHP_EOL
  . 'Spoofchecker internal error information:' . PHP_EOL;
var_dump($sError);

Results

Default locale: en_US_POSIX
Byte length: 3
Text "abc" IS suspicious
Spoofchecker internal error information:
NULL    

Expected results

Text "abc" IS NOT suspicious

This is because abc is US-ASCII which assumably should be the default for en_US_POSIX. Also PHP Spoofchecker class mentions that the return code of Spoofchecker::isSuspicious() would be TRUE if any non-English characters are used, which is not the case here.

Possible causes

The documentation of Spoofchecker::setAllowedLocales() is currently close to non-existent, the argument list does not contain a list of possible values. One can only assume that it must be compatible with that of Locale. The documentation reads:

Locales are identified using RFC 4646 language tags (which use hyphen, not underscore)

contradicts the test result where Locale uses underscores for the default locale instead of hyphens. But when running another test with $oSpoofchecker->setAllowedLocales('en-US'); the results stay the same.

Question

How to use Spoofchecker::isSuspicious() properly?

Code4R7
  • 2,600
  • 1
  • 19
  • 42
  • Why is 'abc' US-ASCII if your file encoding is UTF-8? Have you made sure it is **really** UTF-8 – Xatenev Apr 06 '18 at 07:42
  • 1
    US-ASCII is a subset of UTF-8. To double check I replaced `$sText = 'abc';` with `$sText = chr(97) . chr(98) . chr(99);` which unfortunately did not change the result. And I double checked both visually with an UTF-8 string and I checked the settings of Eclipse. – Code4R7 Apr 06 '18 at 07:48

2 Answers2

2

PHP's Intl extension is just a wrapper around ICU, whose Spoofchecker received a reduction in false positives starting with ICU version 58.

From their bug tracker:

ICU 58 reflects the latest Unicode update, which deprecates the Whole-Script Confusables (WSC) check and Mixed-Script Confusables (MSC) check and is available at ​http://www.unicode.org/L2/L2016/16229-revising-uts-39-algorithm.pdf.

Under ICU 57, the checks (WSC and MSC) had the following pitfalls:

  1. They did not restrict themselves to the set of characters specified by SpoofChecker#setAllowedChars or SpoofChecker#setAllowedLocales.
  2. They did not correctly handle confusables containing multiple skeleton characters, like 'æ' to 'ae'.
  3. WSC exhibited a high false-positive rate, especially as more and more entries were being added to confusables.txt.
  4. All strings failing MSC also fail Restriction Level. (Your string, "goօgle", is an example.)

With these pitfalls in mind, WSC and MSC were removed from ICU 58.

Emphasis mine. The WSC check is what your string is failing. (Note that it passes where ICU version is 58.1 and up as that check has been removed entirely.)

As to how to use Spoofchecker::isSuspicious() properly:

  1. Upgrade ICU (which is a good idea in general) or
  2. Use Spoofchecker::setChecks() as noted in Syscall's answer and omit the WSC check Spoofchecker::WHOLE_SCRIPT_CONFUSABLE (which covers this case) and the MSC check Spoofchecker::MIXED_SCRIPT_CONFUSABLE (which is likewise removed from recent versions.)
user3942918
  • 25,539
  • 11
  • 55
  • 67
1

You could use Spoofchecker::setChecks(int $checks) to specify how the string will be verified.

The $checks constants are listed in the Spoofchecker class documentation, and described by a user in comments.

You can use SpoofChecker::CHAR_LIMIT (or a combination of multiple constants, eg: SpoofChecker::CHAR_LIMIT|Spoofchecker::INVISIBLE):

CHAR_LIMIT: Check that an identifier contains only characters from a specified set of acceptable characters.
INVISIBLE: Check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark.

$sDefaultLocale = (new \Locale)->getDefault();
$oSpoofchecker = new \Spoofchecker;
$oSpoofchecker->setAllowedLocales($sDefaultLocale);
$oSpoofchecker->setChecks(SpoofChecker::CHAR_LIMIT);
$sText = 'abc';  // US-ASCII
header('Content-Type: text/plain');
print
    'Default locale: ' . $sDefaultLocale . PHP_EOL
  . 'Byte length: ' . strlen($sText) . PHP_EOL  // US-ASCII check
  . 'Text "' . $sText . '" '
  . ($oSpoofchecker->isSuspicious($sText, $sError) ? 'IS' : 'IS NOT')
  . ' suspicious' . PHP_EOL
  . 'Spoofchecker internal error information:' . PHP_EOL;
var_dump($sError);

Will outputs:

Default locale: en_US_POSIX
Byte length: 3
Text "abc" IS NOT suspicious
Spoofchecker internal error information:
NULL

Using the example from the isSuspicious() documentation, the text Рaypal.com (with first letter is from Cyrylic), the method with returns:

Text "Рaypal.com" IS suspicious 
Syscall
  • 19,327
  • 10
  • 37
  • 52