Simple libxml2 HTML parsing example, using Objective-c, Xcode, and HTMLparser.h

Question

Please can somebody show me a simple example of parsing some HTML using libxml.

#import <libxml2/libxml/HTMLparser.h>

NSString *html = @"<ul>"
    "<li><input type=\"image\" name=\"input1\" value=\"string1value\" /></li>"
    "<li><input type=\"image\" name=\"input2\" value=\"string2value\" /></li>"
  "</ul>"
  "<span class=\"spantext\"><b>Hello World 1</b></span>"
  "<span class=\"spantext\"><b>Hello World 2</b></span>";

1) Say I want to parse the value of the input whose name = input2.

Should output "string2value".

2) Say I want to parse the inner contents of each span tag whose class = spantext.

Should output: "Hello World 1" and "Hello World 2".

libxml is for xml parsing and for that you need to see TouchXML. — Ayaz Alavi, Jun 11 '10 at 07:56
Even though I'm using HTMLparser.h? I'll take a look at TouchXML thanks. — StuR, Jun 11 '10 at 08:37
@Ayaz: libxml2 supports HTML4 parsing. From the sparse documentation of TouchXML, it seems it doesn't, so it's not appropriate in this instance. — JeremyP, Jun 11 '10 at 10:17
touchXML contains CXMLDocumentTidyHTML property in their CXMLDocument.h file, inferring from that this problem could be solved using touchXML also you can see KissXML which is inspired from touchXML. For pure HTML parser I just found a link http://touchtank.wordpress.com/element-parser/ .. see if it fits for your needs — Ayaz Alavi, Jun 11 '10 at 10:29
http://github.com/zootreeves/Objective-C-HMTL-Parser Did what I wanted, thanks v much for your help. — StuR, Jun 11 '10 at 14:12

StuR · Accepted Answer · 2012-10-24T10:46:45.777

I used Ben Reeves' HTML Parser to achieve what I wanted:

NSError *error = nil;
NSString *html = 
    @"<ul>"
        "<li><input type='image' name='input1' value='string1value' /></li>"
        "<li><input type='image' name='input2' value='string2value' /></li>"
    "</ul>"
    "<span class='spantext'><b>Hello World 1</b></span>"
    "<span class='spantext'><b>Hello World 2</b></span>";
HTMLParser *parser = [[HTMLParser alloc] initWithString:html error:&error];

if (error) {
    NSLog(@"Error: %@", error);
    return;
}

HTMLNode *bodyNode = [parser body];

NSArray *inputNodes = [bodyNode findChildTags:@"input"];

for (HTMLNode *inputNode in inputNodes) {
    if ([[inputNode getAttributeNamed:@"name"] isEqualToString:@"input2"]) {
        NSLog(@"%@", [inputNode getAttributeNamed:@"value"]); //Answer to first question
    }
}

NSArray *spanNodes = [bodyNode findChildTags:@"span"];

for (HTMLNode *spanNode in spanNodes) {
    if ([[spanNode getAttributeNamed:@"class"] isEqualToString:@"spantext"]) {
        NSLog(@"%@", [spanNode allContents]); //Answer to second question
    }
}

[parser release];

I know this is old, but I'm pretty sure he wants "allContents" and not "rawContents" — clarky, Oct 23 '12 at 15:29
@StuR does his library work for iphone development io6 as well? — Dejell, Dec 29 '12 at 22:09
@Odelya I should think so, although I haven't tested it. You may need to set a no arc compiler flag. — StuR, Jan 01 '13 at 17:08

score 1 · Answer 2 · answered Feb 26 '12 at 21:42

1

As Vladimir said, for the second point it's important to replace rawContents with Contents. rawContents will print the complete raw text node, i.e.:

<span class='spantext'><b>Hello World 1</b></span>

answered Feb 26 '12 at 21:42

ElPiter

4,046
9
51
80

Simple libxml2 HTML parsing example, using Objective-c, Xcode, and HTMLparser.h

2 Answers2

Linked