Convert utf-8 back to one-byte binary in PHP

Question

I have a lot of images which has been imported from SQL dump with utf-8 encoding. Thus, instead of "FF D8 FF E0" I see "C3 BF C3 98 C3 BF C3 A0" in the beginning of jpeg images.

I've tried iconv('utf-8', 'iso-8859-1', $data) but it not converts whole file (there is chars in utf-8 which can not be converted to iso-8859-1.

How I can to convert utf-8 simple to one-byte binary with unrespect to encoding?

If the images were indeed treated as iso-8859-1 text and written to the database as utf-8 text, and you can't convert them back, then something's strange. They should be reversible - it doesn't matter that *all* characters in utf-8 aren't representable in iso-8859-1, since *only* characters from iso-8859-1 could have been found in the source images because they were *treated* as iso-8859-1. Which characters are giving you problems? Also, I hope it goes without saying that images shouldn't be treated as text, regardless of encoding. :) — bzlm, Dec 02 '13 at 15:41
If I were you I would simply not store images encoded as UTF8. This solves all the problems here. — Artur, Dec 02 '13 at 15:41
you need to know the encoding that was used when converted to utf-8 — njzk2, Dec 02 '13 at 15:43
@Epsiloncool, can you put one of the images online for us to experiment on? From your example, it looks like the first two bytes at least were successfully and reversibly converted from iso-8859-1 or windows-1252 (or some other 8-bit encoding that includes ÿ and Ø) to utf-8. — bzlm, Dec 02 '13 at 15:48
@bzlm Thank you. I've added a couple of images to my first message. Any help would be appreciated. — Epsiloncool, Dec 02 '13 at 15:57
Initial encoding can be Spanish Latin (iso-8859-1) but I can not convert to it. — Epsiloncool, Dec 02 '13 at 16:05
@Epsiloncool: If input data (image bytes) were converted to UTF8 as if every subsequent byte value was treated as unicode code point - the operation should be completely rdoes not work. There must have been some additional operation involved somewhere on the way. Show us field definition where you keep images. — Artur, Dec 02 '13 at 17:20

score 0 · Accepted Answer · answered Dec 09 '13 at 21:33

The problem was because there are some representations of the same character in UTF-8, called "non-shortest" form. That characters can be converted mathematically, but iconv counts them as errorneous and not converts.

I've made a short function, which converts text of any utf-8 character to Unicode (UTF-16) codepoints array. And then remap some non-ASCII values to ASCII by simple table (for example 0x20ac is the same as 0x80, etc). You can found complete code and remapping table here: Converting UTF-8 with non-shortest characters to one-byte encoding

Convert utf-8 back to one-byte binary in PHP

1 Answers1