1

For the pure purpose of learning C++ and Qt I'm writing a little Qt-based program, which reads HTML files (up to several hundreds) from a local directory, modifies them and writes them back into another local directory.

My first try was using QWebPage and the HTML parsing functionality provided by QWebElement. However I run into some severe problems with memory leaks caused by QWebPage (Which is very likely caused by my lack of using it the right way. But this is another topic and not part of this question).

By now I'm not using any GUI and though I intend to do so later on, this part of my program will never be part of the GUI but somewhere in the background.
Thus I though of replace the usage of QWebPage by QTextBrowser, which seems more lightweight. However, I could not find functions in the Qt-API similar to the parsing functions of QWebElement. So far my code relies on QWebElement::findFirst(), QWebElement::nextSibling() and finally QWebElement::takeFromDocument().

So, is there an almost painless possibility of implementing (or using) QTextBrowser as a HTML parser? Maybe even a 'best practice'?
I do not need to evaluate any JavaScript though it is very likely inlined in the HTML pages. Neither do I need to use CSS for styling, though it is heavily used in the HTML pages in question. I just need to retrieve certain HTML blocks (as table rows) based on their id or CSS class.

PS: I'm only willing of using present C++ HTML parsing libraries in case all feasible and rational attempts using pure Qt fail.

PPS: Just for the sake of seeing and knowing them, I'd also like to get to now extraordinary solutions. ;-)


Here is the part of my current code, where I parse and remove certain parts of the HTML page using QWebElement. reportPage is a QWebPage object.

reportPage->document().findFirst( "table[id=gadgettable]" ).findFirst( "tr[class=c2]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-left-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-right-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();
Torbjörn
  • 5,512
  • 7
  • 46
  • 73

1 Answers1

2

QTextBrowser isn't designed for the kind of editing you're proposing. However, based on your description, the QDomDocument / QDomElement code may work for you - depending on if your input documents are sufficiently XML compatible to be accepted and written-out again by the DOM classes. (In particular, this approach might lose formatting of the elements)

Also the core DOM code lacks advanced query support - you need to either manually search the DOM for id attributes, or use the more advanced XPath / XQuery support.

James Turner
  • 2,425
  • 2
  • 19
  • 24
  • I just tried using QDomDocument. It's really lightweight. But as the HTML files I'm processing far off being valid XHTML, a considerable big part of each file is ignored. I don't know whether XPath/XQuery can handle this. Otherwise I guess, I've to fall back to RegEx. – Torbjörn Dec 22 '11 at 21:40