Unexplained upgrade of a string to utf-8

Question

I have a web server in Perl with POE. Before the data hits the wire, the header and body are concatenated in POE::Filter::HTTPD->put. For some bizare reason, some of the headers are being promoted to utf-8, which means binary body is getting corrupted.

The probleme is that the join in headers_as_strings() is turning upgrading some headers to UTF-8 even if it shouldn't. For example, if I add in the following code, only the last line produces a warning. So a join of 3 non-utf8 strings is producing a UTF-8 string, but not for all headers. The solution is to utf8::downgrade on $ret[-1] but I want to know why this is happening

my $vnl = _process_newline( $value, $endl );
warn "$$: '$name' is utf8" if utf8::is_utf8( $name );
warn "$$: '$sep' is utf8" if utf8::is_utf8( $sep );
warn "$$: '$vnl' is utf8" if utf8::is_utf8( $vnl );
push @ret, join $sep, $name, $vnl;
# only this last line produces a warning
warn "$$: the join has utf8 " if utf8::is_utf8( $ret[-1] );

`is_utf8` only tells you whether Perl's internal flag is set for the string. Don't use it in your code. Author of the code should always know whether the string they operate with contains bytes or codepoints. — choroba, Sep 01 '20 at 19:48
Re "*some of the headers are being promoted to utf-8*", Perl is free to use whichever internal storage format it desires. This doesn't change the string. If you use `eq` to compare the original string and an upgraded/downgraded version of it, it will return true. — ikegami, Sep 01 '20 at 20:23
Re "*The solution is to utf8::downgrade on $ret[-1]*", When code treats a string differently based on its internal storage format (the value returned by `is_utf8`), we say it suffers from The Unicode Bug. `utf8::downgrade` and `utf8::upgrade` are used as a workaround for such bugs. — ikegami, Sep 01 '20 at 20:30
Re "*I want to know why this is happening*", Please provide the output of `use Devel::Peek; Dump($_) for $name, $sep, $vnl, $ret[-1];` (after the `push`). — ikegami, Sep 01 '20 at 20:30

score 2 · Answer 1 · answered Sep 01 '20 at 20:57

2

The short answer is that Perl will upgrade a string to utf-8 without warning. I was using a MIME::Type object that I thought was a string. MIME::Types opens it's DB with open DB, '<:encoding(utf8)'.

But the real WTF is that POE::Driver::SysRW->flush has use bytes; before syswrite() and that's when the data gets jumbled.

answered Sep 01 '20 at 20:57

Leolo

1,327
9
14

aye, one should never use `use bytes;` – ikegami Sep 02 '20 at 04:28
Re "*The short answer is that Perl will upgrade a string to utf-8 without warning.*", True, but I can't think of a situation in which Perl will upgrade the format for no reason. After all, the downgraded format is more efficient. So Perl only upgrades when necessary. – ikegami Sep 02 '20 at 09:19
The longer story is that I was getting the content-type from MIME::Types, which reads its DB with :encoding(utf-8). While POE::Filter::HTTPD was running utf8::downgrad() on that value, the final concatenation of the headers + body was still upgrading it all to UTF-8. – Leolo Sep 03 '20 at 20:41
You're conflating things. `:encoding(UTF-8)` has nothing to do with the internal storage format. `utf8::downgrade` only modifies the internal storage format. The only question you should have to answer is: Does the sub require decode text (aka string of Unicode Code Points) or encoded text (bytes). You should not have to care about the internal storage format. If you have to, it means the code is buggy. So either you passed decoded text when you were supposed to pass encoded text (which has NOTHING to do with `is_utf8`, `downgrade` or `upgrade`), so there's a bug in the sub. – ikegami Sep 03 '20 at 21:51
Or both. It sounds like you passed decoded text, and that the sub uses `use bytes;`. I think HTTP headers are purely ASCII. If what you produced is ASCII-only, then you effectively did provide encoded text, and therefore it's purely a bug in the module. It should never matter that `is_utf8` is true. – ikegami Sep 03 '20 at 21:58
But yeah, if `is_utf8` is true, that means the string uses 32- or 64-bit characters internally. (Remember, a string is really just an array of numbers, and each of these numbers is called a "character", which has nothing to do with letters.) If false, it means the string uses 8-bit characters. Obviously, if makes more sense to use the 32/64-bit format for the concatenation of both, cause the 8-bit format might wouldn't be able to store the characters above 255 which might be present in the one of the inputs. But again, these are INTERNAL formats one should have to care about! – ikegami Sep 03 '20 at 22:01

Unexplained upgrade of a string to utf-8

1 Answers1