2

I need to get an XML file from some CRM software.

The XML file encoding is in UTF-8, but some "strange" characters are present, and I can't parse the file with simple_xml due to these characters.

For example:

<ROW ART_LIB="CAT NxA1 2008"  />

the "xA1" char is present. What is it, and how do I encode it to the "good" character?

The good result to be parsing is:

<ROW ART_LIB="CAT N° 2008"  />

So, actually, to parse the XML file, I do that:

$fichier = utf8_encode(file_get_contents($inputfileName));
$xmlInput = simplexml_load_string($fichier);

How can I fix it?


Thanks to the help of Jason Coco, I've fix the problem to do it:

function mac_roman_to_iso($string)
{
    return strtr($string,
        "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa1\xa4\xa6\xa7\xa8\xab\xac\xae\xaf\xb4\xbb\xbc\xbe\xbf\xc0\xc1\xc2\xc7\xc8\xca\xcb\xcc\xd6\xd8\xdb\xe1\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf1\xf2\xf3\xf4\xf8\xfc\xd2\xd3\xd4\xd5Ð",
        "\xc4\xc5\xc7\xc9\xd1\xd6\xdc\xe1\xe0\xe2\xe4\xe3\xe5\xe7\xe9\xe8\xea\xeb\xed\xec\xee\xef\xf1\xf3\xf2\xf4\xf6\xf5\xfa\xf9\xfb\xfc\xb0\xa7\xb6\xdf\xae\xb4\xa8\xc6\xd8\xa5\xaa\xba\xe6\xf8\xbf\xa1\xac\xab\xbb\xa0\xc0\xc3\xf7\xff\xa4\xb7\xc2\xca\xc1\xcb\xc8\xcd\xce\xcf\xcc\xd3\xd4\xd2\xda\xdb\xd9\xaf\xb8\x22\x22\x27\x27-");
}

$fichier = mac_roman_to_iso(file_get_contents($fichier));
$xmlInput = simplexml_load_string(utf8_encode($fichier));

And after, encode the value from ISO-8859-1 to UTF-8 with iconv().

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
bahamut100
  • 1,795
  • 7
  • 27
  • 38
  • Are you 100% sure the remote file's encoding is UTF-8? What happens if you treat it as ISO-8859-1, does it look better? If the remote file is serving incorrectly encoded data, the best way to go is to try to get them to fix it (or override the encoding if possible) – Pekka Aug 19 '11 at 08:40
  • Why do you `utf8_encode` it again, if you are sure it's already `UTF-8`-encoded. Maybe `$fichier = utf8_decode(file_get_contents($inputfileName));' will do the trick? – J0HN Aug 19 '11 at 08:41
  • Yes, it's UTF-8. When I convert it to notepad++, I get it : – bahamut100 Aug 19 '11 at 08:42
  • ¡ is the character for xA1 in e.g. ANSI, latin1, ... Are there any other characters in the document present that are not ASCII (0-127) and also do not cause an error with simplexml_load_string() ? – VolkerK Aug 19 '11 at 08:47
  • I agree with @Pekka... the xA1 doesn't really make any sense and isn't valid UTF-8 at all. The latin-1 encoding for it is 0xB0 which is 0xC2 0xB0 in UTF-8. – Jason Coco Aug 19 '11 at 08:47
  • So your encoding is not UTF-8. The encoding you have where that symbol is at 0xA1 is MacRoman. You need to treat the XML as MacRoman encoded *not* UTF-8 encoded. – Jason Coco Aug 19 '11 at 08:56
  • possible duplicate of [How to load XML with PHP when it fails with Input is not proper UTF-8 error?](http://stackoverflow.com/questions/1354263/how-to-load-xml-with-php-when-it-fails-with-input-is-not-proper-utf-8-error) – kenorb Mar 19 '15 at 23:03

2 Answers2

2

The problem is not with UTF-8. The problem is that your XML file is not UTF-8 encoded, it is MacRoman encoded. Treat it as a MacRoman-encoded file and it should work fine.

Jason Coco
  • 77,985
  • 20
  • 184
  • 180
1

Ideally I think you should never have to use utf8_encode() or utf8_decode().

You have to have the same encoding declared at all the levels of your application.

Did you check the default encoding of your CRM, database, php files, browser ?

Pierre de LESPINAY
  • 44,700
  • 57
  • 210
  • 307