Detect if string is Unicode or binary

Question

How do I determine if a string is Unicode text or contains any binary data?

Using ctype_print will only work if you 100% expect the string to only be ASCII, mine will contain Unicode.
preg_match('~[^\x20-\x7E\t\r\n]~', $str) > 0 only covers a limited range of Unicode.
strpos($string, "\0")===FALSE implies all binary data must have a NUL byte.
mb_detect_encoding detects strings as UTF-8 even if all characters are exclusively binary.
mb_check_encoding detects strings as UTF-8 even if all characters are exclusively binary.
strlen($string) != strlen(utf8_decode($string)) can only detect if a string is not ASCII.

One possible approach: detect if any characters have an ID that is beyond Unicode. However I don't know how binary data works and if that is applicable. Nor could I find anything on returning a character's numeric assignment (e.g. ! is 0021).

Did you tried https://www.php.net/manual/en/function.mb-detect-encoding.php? — Urmat Zhenaliev, Nov 24 '20 at 05:36
Or better https://www.php.net/manual/en/function.mb-check-encoding.php — Urmat Zhenaliev, Nov 24 '20 at 05:36
Does this answer your question? [Check unicode in PHP](https://stackoverflow.com/questions/1350758/check-unicode-in-php) — Urmat Zhenaliev, Nov 24 '20 at 05:38
@UrmatZhenaliev I got pulled away from my desk for two hours; thank you for the suggestions. I think a mix might work, I'm looking in to what you've posted. I know offhand I've looked at - at least one of those already. — John, Nov 24 '20 at 06:46
@UrmatZhenaliev Unfortunately all of those methods test positive for UTF-8 even if I'm using exclusively all binary data. — John, Nov 24 '20 at 07:07
Do you mean UTF-8 vs. some other 8-bit encoding like ISO-Latin1 or Windows-1252? If you've got access to the `file` command it can usually *guess*. — tadman, Nov 24 '20 at 07:20
If it validates as UTF-8 encoded, then it’s valid UTF-8. Whether it was *meant* to be that or just accidentally happens to validate as UTF-8 you cannot tell. You could try a text analysis library which might tell you whether it appears to be text that makes sense, but that’s still just a guess. You should *know* what you’re trying to deal with, not guess. — deceze, Nov 24 '20 at 07:22
@tadman No, I am *not* trying to detect what text encoding is being used, I'm trying to determine if the string is text/Unicode or binary. I clarify Unicode because all of the "accepted" answers are valid only for ASCII which is something like only 0.001% of Unicode. Additionally binary data can and does contain valid Unicode characters as the binary data itself. Too many people get an answer that works from a very limited perspective and accept it as some universal truth. — John, Nov 24 '20 at 07:25
@deceze This is for email attachments and there are literally ~1,000 *official* file types (https://www.iana.org/assignments/media-types/media-types.xml) and then there are the non-official ones. I can't foretell what will be attached or not. — John, Nov 24 '20 at 07:27
And what are you trying to do with these attachments that requires this distinction? — deceze, Nov 24 '20 at 07:28
You keep saying "binary". What does this mean? What *sort* of binary? I'd strongly suggest using the `file` tool as it has patterns to match a lot of things, but it's a complex piece of machinery that's not trivial to re-implement, especially in PHP. `is_unicode()` is not a sure thing, there's lots of binary data that *could* be construed as valid UTF-8. The reason people talk about limited answers is because there is no universal truth, there's only rough approximations under *extremely specific parameters*. — tadman, Nov 24 '20 at 07:37
If you're doing this in the real world you will get UTF-ish data, where it's almost but not quite UTF-8 due to encoding issues and byte mangling, or failed conversions. Not everything will be 100% clean. At best you can be 99.99% certain something is *probably* UTF-8. What tools like that do is examine the byte pattern to see if it follows the UTF-8 convention fairly closely, and if so, predicts UTF-8 as the encoding. — tadman, Nov 24 '20 at 07:39
This question doesn't make sense. You're asking how to know whether a given stream of bytes should be interpreted as Unicode or as "binary", which... is logically impossible. It **is** binary, regardless of whether it's also Unicode. Some binary sequences will yield intelligible Unicode documents and some will yield unintelligible Unicode documents, but you can't generalize this. You cannot look at an arbitrary stream of data and tell for certain whether it is Unicode, you need to rely on context. Regarding your email attachment example: Look at file extensions first, not file contents. — user229044, Nov 24 '20 at 14:50

score 4 · Answer 1 · answered Nov 24 '20 at 11:39

It is always a subjective perspective about what is binary and what not:

"\x46\x61\x69\x6c" can be:
- the text Fail as per ASCII and UTF-8
- the text 䙡楬 as per UTF-16 BE
- the number 1180789100 as per 32bit Integer BE
- the timestamp 2028-10-06, 12:10:12 as per DOS datetime
- the dimensions 24902 x 27753 for two 16bit LE integers, interpreted as width and height
"\xf0\x9f\x98\x83" can be:
- the text as per UTF-8
- the text рЯШГ as per codepage 1283/10007/x-mac-cyrillic
- the number -8.9704769e-37 as per IEEE 754 Single 32bit LE

As you see: it can be both binary and text. What you want is down to heuristics and pattern recognition, but both cannot give you the one and only correct answer, only indications. Likewise you can throw all your bytes into a text encoding detection to then see which encodings match (just like with mb_detect_encoding() (make sure to use strict mode), but at the end of the day it's only as robust the input is - if you only have i.e. 5 bytes then they most likely match at least one text encoding, while 500 bytes may or may not violate all yet known text encodings.

Checking for "\x00" is not good either, as those will occur at least in UTF-16 and UTF-32. When doing charset detection spotting NULLs may indicate UTF-16, but can also lead to wrong results like "Bush hid the facts".

Detecting file formats in general (in contrast to detecting text alone) is bit easier when signatures are defined which help identifying a format. As for texts this can only be a byte order mark, which only few encodings know and which aren't mandatory either.

Explained in detail with good examples. mb_detect_encoding() does not work reliably in strict mode either. I use preg_match ("//u", $string) to identify a valid UTF-8 string. — jspit, Nov 24 '20 at 16:45

Detect if string is Unicode or binary

1 Answers1

Linked