9

Please can somebody show me a simple example of parsing some HTML using libxml.

#import <libxml2/libxml/HTMLparser.h>

NSString *html = @"<ul>"
    "<li><input type=\"image\" name=\"input1\" value=\"string1value\" /></li>"
    "<li><input type=\"image\" name=\"input2\" value=\"string2value\" /></li>"
  "</ul>"
  "<span class=\"spantext\"><b>Hello World 1</b></span>"
  "<span class=\"spantext\"><b>Hello World 2</b></span>";

1) Say I want to parse the value of the input whose name = input2.

Should output "string2value".

2) Say I want to parse the inner contents of each span tag whose class = spantext.

Should output: "Hello World 1" and "Hello World 2".

outis
  • 75,655
  • 22
  • 151
  • 221
StuR
  • 12,042
  • 9
  • 45
  • 66
  • libxml is for xml parsing and for that you need to see TouchXML. – Ayaz Alavi Jun 11 '10 at 07:56
  • Even though I'm using HTMLparser.h? I'll take a look at TouchXML thanks. – StuR Jun 11 '10 at 08:37
  • 2
    @Ayaz: libxml2 supports HTML4 parsing. From the sparse documentation of TouchXML, it seems it doesn't, so it's not appropriate in this instance. – JeremyP Jun 11 '10 at 10:17
  • touchXML contains CXMLDocumentTidyHTML property in their CXMLDocument.h file, inferring from that this problem could be solved using touchXML also you can see KissXML which is inspired from touchXML. For pure HTML parser I just found a link http://touchtank.wordpress.com/element-parser/ .. see if it fits for your needs – Ayaz Alavi Jun 11 '10 at 10:29
  • http://github.com/zootreeves/Objective-C-HMTL-Parser Did what I wanted, thanks v much for your help. – StuR Jun 11 '10 at 14:12

2 Answers2

19

I used Ben Reeves' HTML Parser to achieve what I wanted:

NSError *error = nil;
NSString *html = 
    @"<ul>"
        "<li><input type='image' name='input1' value='string1value' /></li>"
        "<li><input type='image' name='input2' value='string2value' /></li>"
    "</ul>"
    "<span class='spantext'><b>Hello World 1</b></span>"
    "<span class='spantext'><b>Hello World 2</b></span>";
HTMLParser *parser = [[HTMLParser alloc] initWithString:html error:&error];

if (error) {
    NSLog(@"Error: %@", error);
    return;
}

HTMLNode *bodyNode = [parser body];

NSArray *inputNodes = [bodyNode findChildTags:@"input"];

for (HTMLNode *inputNode in inputNodes) {
    if ([[inputNode getAttributeNamed:@"name"] isEqualToString:@"input2"]) {
        NSLog(@"%@", [inputNode getAttributeNamed:@"value"]); //Answer to first question
    }
}

NSArray *spanNodes = [bodyNode findChildTags:@"span"];

for (HTMLNode *spanNode in spanNodes) {
    if ([[spanNode getAttributeNamed:@"class"] isEqualToString:@"spantext"]) {
        NSLog(@"%@", [spanNode allContents]); //Answer to second question
    }
}

[parser release];
StuR
  • 12,042
  • 9
  • 45
  • 66
1

As Vladimir said, for the second point it's important to replace rawContents with Contents. rawContents will print the complete raw text node, i.e.:

<span class='spantext'><b>Hello World 1</b></span>
ElPiter
  • 4,046
  • 9
  • 51
  • 80