2

I have an NSScanner object that scans through HTML documents for paragraph tags. It seems like the scanner stops at the first result it finds, but I need all the results in an array.

How can my code be improved to go through an entire document?

- (NSArray *)getParagraphs:(NSString *) html 
{
    NSScanner *theScanner;
    NSString *text = nil;

    theScanner = [NSScanner scannerWithString: html];

    NSMutableArray*paragraphs = [[NSMutableArray alloc] init];

    // find start of tag
    [theScanner scanUpToString: @"<p>" intoString: NULL];
    if ([theScanner isAtEnd] == NO) {
        NSInteger newLoc = [theScanner scanLocation] + 10;
        [theScanner setScanLocation: newLoc];

        // find end of tag
        [theScanner scanUpToString: @"</p>" intoString: &text];

        [paragraphs addObject:text];
    }

    return text;
}
Pripyat
  • 2,937
  • 2
  • 35
  • 69

2 Answers2

6

Do not use a scanner to parse HTML (and don't use regular expressions, either.... oh, the pain)*. The whole point of HTML is that it is a structured document that is designed to be traversed as a tree of nodes or object. Pretty much the entire DOM [Document Object Model] based industry is built around this.

Just use an XML parser as [well structured HTML is really just XML anyway]. NSXMLDocument (or -- if you need event driven -- NSXMLParser) will work grand.

Or, if you have to deal with malformed HTML (i.e. arbitrary server sewage), use a proper HTML parser.

This question/answer describes exactly that, with a solid example.

*Not to mention that parsing HTML is a "solved problem" in the industry. There is no need to roll a new one.

Community
  • 1
  • 1
bbum
  • 162,346
  • 23
  • 271
  • 359
  • Which is why you use libxml2's HTML4 parser.... Anytime someone says "HTML is not XML", they typically are referring to the lack of enforced proper tag structure, which the libxml2 should take care of. – bbum Jun 12 '11 at 21:00
  • The differences between HTML and XHTML are more than compulsory end tags (or self-closing tags). So "well structured HTML is really just XML anyway" doesn't sound very accurate. HTML can be well structured and still not be XML. Anyway, I agree that it's better to use a HTML parser. – albertamg Jun 13 '11 at 07:15
2

Disclaimer: To parse HTML, it's better to use a HTML parser like libxml's HTML 4 parser, especially to deal with arbitrary possibly malformed HTML. Anyway, since the question asks how to improve existing code using NSParser, I provide the following example. This will work in most cases but there are some corner cases where it won't. For seriuos HTML parsing, use a HTML parser.


Iterate until the scanner has exhausted all characters:

NSScanner* scanner = [NSScanner scannerWithString:html];
NSMutableArray *paragraphs = [[NSMutableArray alloc] init];
[scanner scanUpToString:@"<p" intoString:nil];
while (![scanner isAtEnd]) {
    [scanner scanUpToString:@">" intoString:nil];
    [scanner scanString:@">" intoString:nil];
    NSString * text = nil;
    [scanner scanUpToString:@"</p>" intoString:&text];
    if (text) { // if html contains empty paragraphs <p></p>, text could be nil
        [paragraphs addObject:text];
    }
    [scanner scanUpToString:@"<p" intoString:nil];
}
...
[paragraphs release];
albertamg
  • 28,492
  • 6
  • 64
  • 71
  • 2
    That'll fail in any number of perfectly valid cases; `

    ...

    ` or `

    This is a

    paragraph in a paragraph

    .`, for example.
    – bbum Jun 12 '11 at 20:03
  • I edited the code to take into consideration element attributes as @bbum suggested. About the nested paragraphs, I don't think they are valid html. – albertamg Jun 12 '11 at 20:43
  • Yup -- you are right. `

    f

    d

    c` is not valid... but, when parsing arbitrary HTML, "valid" typically means "whatever browser du jour" does and structural validity is not really a consideration.
    – bbum Jun 12 '11 at 21:02
  • Also -- IIRC, a `

    – bbum Jun 12 '11 at 21:07
  • I edited the question to reflect that it would be better to use a HTML parser – albertamg Jun 13 '11 at 07:31