43

If I have a PHP string, how can I determine if it contains at least one non-ASCII character or not, in an efficient way? And by non-ASCII character, I mean any character that is not part of this table, http://www.asciitable.com/, positions 32 - 126 inclusive.

So not only does it have to be part of the ASCII table, but it also has to be printable. I want to detect a string that contains at least one character that does not meet these specifications (either non-printable ASCII, or a different character altogether, such as a Unicode character that is not part of that table.

rid
  • 61,078
  • 31
  • 152
  • 193
  • So you do not mean Unicode, but non `US-ASCII`? I think this is worth to specify if you're looking for something efficient. – hakre Jun 27 '11 at 19:14
  • Can you make any safe assumption about the string, such as encoding? – Álvaro González Jun 27 '11 at 19:19
  • 3
    All ASCII characters are <= 127, and any UTF-8 character sequence that decodes to a non-ASCII character has at least one byte with the highest bit set. Thus, if you have no byte >127, it's ASCII. Detecting UTF-8 encoding as suggested in the answers below will probably work too, but could possibly be ambiguous (since ASCII characters are incidentially _also_ UTF-8 characters). – Damon Jun 27 '11 at 19:22
  • 1
    Similar to http://stackoverflow.com/questions/4147646/determine-if-utf-8-text-is-all-ascii – Gras Double Jan 07 '15 at 17:30

8 Answers8

76

I found it more useful to detect if any character falls out of the list

if(preg_match('/[^\x20-\x7e]/', $string))
BenMorel
  • 34,448
  • 50
  • 182
  • 322
Karolis
  • 9,396
  • 29
  • 38
  • This answer is good but you can find more solutions in this post http://stackoverflow.com/questions/4147646/determine-if-utf-8-text-is-all-ascii – ElSinus Nov 04 '15 at 08:18
  • @wheresrhys I think your snippets tests if all characters in string are ascii, for any character code should be `/[^\x20-\x7f]/.test(theString)` – Igor Jerosimić Jul 27 '18 at 13:19
44

You can use mb_detect_encoding and check for ASCII:

mb_detect_encoding($str, 'ASCII', true)

This will return false if $str contains at least one non-ASCI character (byte value > 0x7F).

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • 20
    [`mb_check_encoding`](http://php.net/manual/en/function.mb-check-encoding.php) would be more appropriate: `mb_check_encoding($str, 'ASCII')` – Gras Double Jan 07 '15 at 17:26
4

The function ctype_print returns true iff all characters fall into the ASCII range 32-126 (PHP unit test).

Steffen
  • 952
  • 7
  • 8
  • `php -r 'echo ctype_print("\xa0");'` prints `1` so there's something fishy with this function. – forthrin Aug 29 '16 at 11:47
  • @forthrin: I cannot confirm that. For me, `php -r 'var_dump(ctype_print("\xa0"));'` returns false (using PHP 7.0.10). – Steffen Aug 30 '16 at 13:00
  • I'm on PHP 7.0.10 too, Homebrew version (OS X). Could the difference be caused by terminal, locale, php.ini or other environmental factors? – forthrin Aug 30 '16 at 17:49
  • Doesn't work for me either, PHP 7.0.5 on Windows - no idea why. It does not seem to work anymore. We should probably open a bug report? – mindplay.dk Sep 16 '16 at 13:04
4

I benchmarked the suggested functions as I need this check for batch processing of shorter (1000 characters max) strings. I tested 10k iterations of 30 different strings (empty, short, longer, ascii, accents, japanese, emoji, non-ascii start, non-ascii end etc). Here are the rough results:

mb_check_encoding: 95ms average. Performance degrades way faster than preg_match and ctype as the strings get longer (1MB+).

mb_check_encoding($input, 'ASCII');

preg_match: 85ms average. Decently fast for 1MB+ strings (walks the string, so faster if there are non-ascii characters early in the string).

!preg_match('/[\\x80-\\xff]/', $input);

ctype_print: 83ms average. Decently fast for 1MB+ strings (walks the string, so faster if there are non-ascii characters early in the string). DO NOTE that this is not really an ascii check.

ctype_print($input);

while/ord: 500ms average. I'm still waiting for the 1MB+ strings test to finish.

function is_ascii($input) {
    $num = 0;
    while( isset( $string[$num] ) ) {
        if( ord( $string[$num] ) & 0x80 ) {
            return false;
        }
        $num++;
    }
    return true;
}
Bart
  • 863
  • 8
  • 19
4

Try (mb_detect_encoding). For example:

mb_check_encoding($identifier, 'ASCII');
Dharman
  • 30,962
  • 25
  • 85
  • 135
Hans Kerkhof
  • 442
  • 3
  • 7
2

Try: (Source)

function is_ascii( $string = '' ) {
    return ( bool ) ! preg_match( '/[\\x80-\\xff]+/' , $string );
}

Although, all of the above answers are correct, but depending upon the input, these solutions may give wrong answers. See the last section in this ASCII validation post.

Hamid Sarfraz
  • 1,089
  • 1
  • 14
  • 34
1

You could use:

mb_detect_encoding

but it will be maybe not as precise as you want it to be.

fyr
  • 20,227
  • 7
  • 37
  • 53
-1

I suggest you look into utf8_encode or utf8_decode under PHP's manual:

http://www.php.net/manual/en/function.utf8-encode.php

Look into the examples down below as it may have something there that leads you to the right direction if not finding what you are looking for.

Ole Media
  • 1,652
  • 9
  • 25
  • 36