-2

I have this code in an iOS Playground (Swift 3, Xcode 8.2.1):

import UIKit
import PlaygroundSupport

PlaygroundPage.current.needsIndefiniteExecution = true

class ParserDelegate: NSObject, XMLParserDelegate {

    @objc func parser(_ parser: XMLParser, foundCharacters string: String) {
        print("found string:", string)
    }

    func parser(_ parser: XMLParser, parseErrorOccurred parseError: Error) {
        print("error:", parseError)
    }

    func parserDidEndDocument(_ parser: XMLParser) {
        PlaygroundPage.current.finishExecution()
    }

}

let string = "<xml>straße</xml>"
let parser = XMLParser(data: string.data(using: .utf8)!)
let delegate = ParserDelegate()
parser.delegate = delegate
parser.parse()

// prints this:
// found string: stra
// found string: ße

Why does XMLParser split straße into stra and ße, instead of parsing it all as one string? Is there an easy way around this, other than to concatenate all strings found by parser(_:foundCharacters:) until I get a call to parser(_:didEndElement:namespaceURI:qualifiedName:)?

Zev Eisenberg
  • 8,080
  • 5
  • 38
  • 82
  • 1
    "Why?" has several meanings. Why can it do that? Because it can: the interface is clearly [documented](https://developer.apple.com/reference/foundation/xmlparserdelegate/1412539-parser) to work that way. Why does it do that? I don't think anyone outside of the NSXMLParser team can say for sure, but I'd guess that it reads swaths of ASCII bytes quickly (easy to do, and very common), and when it hits a byte with the MSB set, it has to slow down to do full UTF-8 parsing. – Ssswift Mar 25 '17 at 03:32
  • @Ssswift nice theory. Thanks for explaining, and thanks for the documentation link! – Zev Eisenberg Mar 25 '17 at 05:37

1 Answers1

1

It is not your business to care how the parser breaks up a run of text. It is your business to implement parser(_:foundCharacters:) in such a way as to accumulate the text no matter how many times it is called until didEndElement arrives. A typical implementation will look like this:

func parser(_ parser: XMLParser, foundCharacters string: String) {
    self.text = self.text + string
}

...where self.text is a property, managed in didStartElement and didEndElement.

Is there an easy way around this

That's a very silly way to look at it. It isn't something you need a "way around". There's a right way to implement foundCharacters. Do it and get on with life.

matt
  • 515,959
  • 87
  • 875
  • 1,141
  • Thanks. By "easy way," I meant that I was wondering if there was some configuration of `XMLParser` that I was overlooking, that would handle this for me, but having implemented the workaround that you describe, I realize that it's quite straightforward. – Zev Eisenberg Mar 25 '17 at 05:37