0

I am loading an entire HTML page and want to get all content between specific tags. For this I'm doing:

articleXpathQueryString = @"//article/div[@class='entry breadtext']";
articleNodes = [articleParser searchWithXPathQuery:articleXpathQueryString];
item.content = [self recursiveHTMLIterator:articleNodes content:@""];

And then I have a recursive function which attempts to sum up the content from all child nodes as well as their HTML tags:

-(NSString*) recursiveHTMLIterator:(NSArray*)elementArray content:(NSString*)content {
for(TFHppleElement *element in elementArray) {
    if(![element hasChildren]) {
        //The element has no children
    } else {
        //The element has children
        NSString *tmpStr = [[element firstChild] content];

        if(tmpStr != nil) {
            NSString *css = [element tagName];
            content = [content stringByAppendingString:[self createOpenTag:css]];
            content = [content stringByAppendingString:tmpStr];
            content = [content stringByAppendingString:[self createCloseTag:css]];
        }

        NSString *missingStr = [[element firstTextChild] content];
        if(![missingStr isEqualToString:tmpStr]) {
            if(missingStr != nil) {
                NSString *css= [element tagName];
                content = [content stringByAppendingString:[self createOpenTag:css]];
                content = [content stringByAppendingString:missingStr];
                content = [content stringByAppendingString:[self createCloseTag:css]];
            }
        }

        content = [self recursiveHTMLIterator:element.children content:content];
    }
}
return content;
}

However, even though the result is somehow satisfactory, it doesn't acquire img tags and messes up a bit when the HTML is of following format:

<p>
<strong>-</strong>
This text is not parsed because it skips it after it acquires <strong>-</strong>, this is why I have the second if-statement which catches up "missing strings", but they are inserted in the wrong order
</p>

So my question is, should i continue trying to get the recursive method to parse properly, or is there any easier way to acquire the desired HTML (which I then use within a web view). What I'm looking for is all the content withing

<article> THIS </article>.

In orther words, I would like to do something like this with TFHpple (though the code does not work):

articleXpathQueryString = @"//article/div[@class='entry breadtext']";
articleNodes = [articleParser searchWithXPathQuery:articleXpathQueryString];
item.content = [articleParser allContentAsString];    //I simply want everything in articleParser in a string format
Oberheim
  • 78
  • 11

1 Answers1

0

Ok I finally figured this out... I hope this helps if anyone is as stupid as me:

All that needs to be done is to load the URL into a webview and then simply do a simple javascript query as follows (in webViewDidFinishLoad):

NSString *bread_text = [webView stringByEvaluatingJavaScriptFromString:@"document.getElementsByClassName('entry breadtext')[0].innerHTML"];

To get all the content within a well known class. Now I need to figure out how to load it without displaying the webview first but this seems alot easier than iterating through an XML structure :)

Oberheim
  • 78
  • 11