XMLParser has problems reading UTF8 characters

Question

I am trying to parse an XML as follows

<CntyNtry>
    <EngNm>Virgin Islands (British)</EngNm>
    <FrNm>Vierges britanniques (les Îles)</FrNm>
    <A2Cd>VG</A2Cd>
    <A3Cd>VGB</A3Cd>
    <CtryNbr>92</CtryNbr>
</CntyNtry>

As you can see, there are some accents on some of the letters.

I tried to parse the XML with following code

func parser(_ parser: XMLParser, didStartElement elementName: String, namespaceURI: String?, qualifiedName qName: String?, attributes attributeDict: [String : String] = [:]) {
    if elementName == Element.getXMLRecordElementTagName() {
        stack.push(Element.newObject())
        record.removeAll(keepingCapacity: false)
    } else if Element.getXMLRecordAttributeElementTagName().contains(elementName) {
        stackKey.push(Element.getNSManagedObjectAttributeName(fromXMLRecordElementTagName: elementName))
    }
}

func parser(_ parser: XMLParser, foundCharacters string: String) {
    let key = stackKey.pop()
    if key != nil {
        record[key!] = string
    }
}

func parser(_ parser: XMLParser, didEndElement elementName: String, namespaceURI: String?, qualifiedName qName: String?) {
    if elementName == Element.getXMLRecordElementTagName() {
        Element.add(object: record)
        record.removeAll(keepingCapacity: false)
    }
}

If anybody needs the detail of the rest of the code, please let me know but basically record[key!] = string should be able to read the UTF8 characters.

When I test the data on my unit code, I get following error, where the string is not read after the accent string. I have tried all other data with accents and it is same error.

XCTAssertEqual failed: ("Optional("Vierges britanniques (les")") is not equal to ("Optional("Vierges britanniques (les Îles)")") -

Is my unit test code wrong? or is there a problem in the parser?

func testImportDataCnty() {
    Country.delete()
    XCTAssertTrue(Country.count() == 0)
    XCTAssertTrue(importerCnty.importData())
    XCTAssertTrue(Country.count() > 0)

    let kor = Country.get(id: ["VGB"])?[0] as! Country
    XCTAssertEqual(kor.englishName, country2["englishName"] as? String)
    XCTAssertEqual(kor.frenchName, country2["frenchName"] as? String)
    //Test failed on the above row.
    XCTAssertEqual(kor.alpha2Code, country2["alpha2Code"] as? String)
    XCTAssertEqual(kor.alpha3Code, country2["alpha3Code"] as? String)
    XCTAssertEqual(kor.countryNumber, Int16(country2["countryNumber"] as! Int))
}

I'm not sure we've got enough to diagnose this, as you're referencing some types that are non-standard and undefined. It looks like some form of UTF8 to/from C string problem, or something like that, but there's not enough here to diagnose. — Rob, Jan 02 '17 at 06:51
Likely unrelated, your `foundCharacters` doesn't look quite right, because it sometimes can take more than one call to `foundCharacters` to return the whole string. I don't think that's the problem here, but it does seem like a more subtle problem in this code... — Rob, Jan 02 '17 at 06:52
Thank you for the comment @Rob. The rest of the code is to more related to saving data to CoreData and getting it back. Does that have to do anything with this problem? — Ham Dong Kyun, Jan 02 '17 at 09:27
All I know is that `foundCharacters` generally handles UTF8 strings fine. So, now that you know that UTF8 string is getting corrupted along the way, start examining it at every step of the process (starting at `foundCharacters`) and see where the problem occurs. But there's not enough here for us to diagnose it. — Rob, Jan 02 '17 at 16:32

score 1 · Answer 1 · answered Jan 02 '17 at 17:24

1

You should store any special or foreign language characters in the XML in their HTML encoded form. As an example, when I needed to write XML for an Ampersand I did the following:

<name>Jones &amp; Jones</name>

In your case, it should be:

<FrNm>Vierges britanniques (les &Icirc;les)</FrNm>

See this HTML encoding table.

answered Jan 02 '17 at 17:24

IntelliData

441
6
29

Thank you for the idea. It helped. But also it seems that func parser(_ parser: XMLParser, foundCharacters string: String) reads multiple times within a tag if there is special characters. (An idea presented by @Rob) – Ham Dong Kyun Jan 04 '17 at 23:02

score 1 · Accepted Answer · answered Jan 04 '17 at 23:08

I have solved the issue by changing my code as below. It seems that foundCharacter parser reads the string multiple times if there is a special character in the string, so I needed to append them all.

func parser(_ parser: XMLParser, foundCharacters string: String) {
    let key = stackKey.peek()
    if key != nil {
        if record[key!] != nil {
            record[key!] = record[key!]! + string
        } else {
            record[key!] = string
        }
    }
}

XMLParser has problems reading UTF8 characters

2 Answers2

Linked