3

I'm working on an app which aggregates some feeds from the internet and reformats the content. So I'm looking for a way to parse some HTML. Given XML and HTML are very similar in structure I was thinking "maybe I should just use an NSXMLParser" I'm already using it to parse my RSS feeds and I've become comfortable using it, but I'm running into a problem.

The parser will not recognize <p> as an element. It has no problem extracting elements like <title>, or <img>, but it doesn't like <p>. Has anyone tried doing this, and if so do you have any suggestion or work arounds for this issue? I think the XMLParser is good for what I'm doing and I would like to use it, but obviously, if I can't get the text in <p> elements it's completely useless to me.

Any suggestions are welcome, even ones suggesting a different method entirely. I've looked into some third party libraries for doing this but from what I've read they all have some bugs and I would much prefer to use something provided by Apple.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
evanmcdonnal
  • 46,131
  • 16
  • 104
  • 115

3 Answers3

4

There's absolutely nothing special about "p" as the name of an element. While it is hard to be sure because you haven't provided an example of the HTML you are parsing, the problem is most likely caused by HTML that is not well-formed XML. In other words, using NSXMLParser would work on XHTML, but not necessarily plain-old HTML.

The "p" element is frequently found in HTML without the matching closing tag, which is not valid XML. My guess is that you would have to convert the HTML to XHTML before trying to parse it with an NSXMLParser

Tim Dean
  • 8,253
  • 2
  • 32
  • 59
  • Would any unclosed tag cause the NSXMLParser to fail, or would it only fail on the tag which is not closed? I haven't inspected all of the HTML but I think all the "p" tags are closed. Here's a link to it; view-source:http://www.americansongwriter.com/2011/05/behind-the-song-the-gambler/ – evanmcdonnal Jan 15 '12 at 21:52
  • Any unclosed tag would cause any XML parser to fail in some way, including NSXMLParser. What you have is clearly not valid XML. I pulled down the source and did an XML validation and got loads of errors. For example, it's got unclosed "div" and "input" tags on line 300 and 301. It's also got invalid angle bracket characters in XML attributes (e.g. on line 357). The list of errors is quite long: This HTML is not valid XML, so NSXMLParser won't work on it directly. It will need to be cleaned up first. – Tim Dean Jan 15 '12 at 22:24
  • Alright. I don't want to extend this question too much further, but would you still recommend turning it into XHTML or do you think it would be better to go with another parsing method altogether? Also, if you think converting to XHTML would be best, could you point me towards some reference on how to do that. – evanmcdonnal Jan 15 '12 at 22:40
  • I haven't tried so I can't vouch for any specific conversion tool, but you could try http://www.chilkatsoft.com/html-objc.asp or use TouchXML with it's Tidy HTML feature (see http://stackoverflow.com/questions/4258333/iphone-html-parsing-using-touchxml-and-tidy) – Tim Dean Jan 15 '12 at 22:46
1

I recommend you use my DTHTMLParser which is modeled after NSXMLParser and uses libxml2 to parse HTML perfectly. You generally cannot rely on the HTML to be well-formed and be parseable as xml.

libxml2 has a HTML mode where it is able to ignore things like un-closed tags and whatever HTML might have in ideosyncrasies.

HTML parsing explained:

DTHTMLParser documentation:

Source, part of DTFoundation:

Cocoanetics
  • 8,171
  • 2
  • 30
  • 57
1

HTML is not necessarily well-formed XML, and that's the trouble when you parse it as XML.

Take the following example:

<body>
    <p>123
    <p>abc
    <p>789
</body>

If you view this chunk of html in a browser, it would show just as what you expected. But if you parse this as xml, there would be trouble, as those p tags are not closed.

勿绮语
  • 9,170
  • 1
  • 29
  • 37