Is this a valid UTF8 character in this xml file?

Question

I have received some XML from an upstream data source.

I'm not sure if these weird characters are valid UTF8 -or- the upstream source has messed things up. i.e. Bad data in => bad data out.

I'm guessing the following is what was passed down:

Value in XML file  | Unicode Value | UTF-8 Value  | English Description
-------------------------------------------------------------------------------------------
&#xE2;&#x80;&#x99; | U+2019        | \xe2\x80\x99 | RIGHT SINGLE QUOTATION MARK
&#xE2;&#x80;&#xA2; | U+2022        | \xe2\x80\xa3 | BULLET
&amp;              | -not unicode- | --           | Ampsersand, HTML Encoded.

i feel like the \ at the start of the UFT-8 value is sorta... encoded but .. done wrong?

Can someone please explain what I'm looking at, so I know how to correctly decode it. What's also frustrating is that i feel like this could be a mix of encodings which will make things awful :(

Reference: http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string-literal

score 4 · Accepted Answer · answered Sep 05 '17 at 11:05

It's not a matter of UTF-8 in the XML you receive because character escapes of the &#xXX; encode characters and so there's no question of what the encoding is. [Actually, it could be this, in that it could be that whatever is producing the XML was written by someone who doesn't understand how XML escapes are meant to work. After all, once something is buggy, there's no point assuming it does anything correctly until proven otherwise.]

It does look like something along the way has treated some perfectly good UTF-8 as if it was a different encoding, then decided to escape the results. Some of the characters you are getting as a result of this ('U+0080' and 'U+0099') are characters that are allowed in XML but strongly discouraged. Some ('â' and '¢') are perfectly sensible characters (though produced in non-sensible ways) that makes the decision to escape it nearly as strange as whatever mistake led to their being there.

Whatever the source of the mojibake, you're getting mojibake, so if you can complain or report a bug upstream, do so and have it fixed at source rather than trying to fix something that is broken.

Otherwise you're going to have to try to unescape the characters, encode them as if they were whatever format they thought they were (I'd guess ISO Latin 1, but there are other possibilities) and then decode them as if they were UTF-8. There's no promise that that won't do just as much damage to a correct bit of the document as it undoes to that buggy bit though.

OK - so it's fair to say that the characters in question *are* mojibake ? Because that was my gut feeling ... and because of that, i was going to stop .. and go upstream for them to fix stuff. — Pure.Krome, Sep 05 '17 at 13:34
Yup. It's an upstream problem, so an upstream solution would always be best. — Jon Hanna, Sep 05 '17 at 13:49
To be clear, the question itself indicates a misconception. Numeric character entity references encode Unicode codepoints, not UTF-8 code units. There is no such thing as a UTF-8 character. — Tom Blodget, Sep 06 '17 at 01:50

Is this a valid UTF8 character in this xml file?

1 Answers1

Linked