0

I have a lot of images which has been imported from SQL dump with utf-8 encoding. Thus, instead of "FF D8 FF E0" I see "C3 BF C3 98 C3 BF C3 A0" in the beginning of jpeg images.

I've tried iconv('utf-8', 'iso-8859-1', $data) but it not converts whole file (there is chars in utf-8 which can not be converted to iso-8859-1.

How I can to convert utf-8 simple to one-byte binary with unrespect to encoding?

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Epsiloncool
  • 1,435
  • 16
  • 39
  • 2
    If the images were indeed treated as iso-8859-1 text and written to the database as utf-8 text, and you can't convert them back, then something's strange. They should be reversible - it doesn't matter that *all* characters in utf-8 aren't representable in iso-8859-1, since *only* characters from iso-8859-1 could have been found in the source images because they were *treated* as iso-8859-1. Which characters are giving you problems? Also, I hope it goes without saying that images shouldn't be treated as text, regardless of encoding. :) – bzlm Dec 02 '13 at 15:41
  • If I were you I would simply not store images encoded as UTF8. This solves all the problems here. – Artur Dec 02 '13 at 15:41
  • you need to know the encoding that was used when converted to utf-8 – njzk2 Dec 02 '13 at 15:43
  • @Artur unfortunately I have no image originals. – Epsiloncool Dec 02 '13 at 15:46
  • @Epsiloncool, can you put one of the images online for us to experiment on? From your example, it looks like the first two bytes at least were successfully and reversibly converted from iso-8859-1 or windows-1252 (or some other 8-bit encoding that includes ÿ and Ø) to utf-8. – bzlm Dec 02 '13 at 15:48
  • @bzlm Thank you. I've added a couple of images to my first message. Any help would be appreciated. – Epsiloncool Dec 02 '13 at 15:57
  • Initial encoding can be Spanish Latin (iso-8859-1) but I can not convert to it. – Epsiloncool Dec 02 '13 at 16:05
  • @Epsiloncool: If input data (image bytes) were converted to UTF8 as if every subsequent byte value was treated as unicode code point - the operation should be completely rdoes not work. There must have been some additional operation involved somewhere on the way. Show us field definition where you keep images. – Artur Dec 02 '13 at 17:20

1 Answers1

0

The problem was because there are some representations of the same character in UTF-8, called "non-shortest" form. That characters can be converted mathematically, but iconv counts them as errorneous and not converts.

I've made a short function, which converts text of any utf-8 character to Unicode (UTF-16) codepoints array. And then remap some non-ASCII values to ASCII by simple table (for example 0x20ac is the same as 0x80, etc). You can found complete code and remapping table here: Converting UTF-8 with non-shortest characters to one-byte encoding

Epsiloncool
  • 1,435
  • 16
  • 39