For the pure purpose of learning C++ and Qt I'm writing a little Qt-based program, which reads HTML files (up to several hundreds) from a local directory, modifies them and writes them back into another local directory.
My first try was using QWebPage
and the HTML parsing functionality provided by QWebElement
. However I run into some severe problems with memory leaks caused by QWebPage
(Which is very likely caused by my lack of using it the right way. But this is another topic and not part of this question).
By now I'm not using any GUI and though I intend to do so later on, this part of my program will never be part of the GUI but somewhere in the background.
Thus I though of replace the usage of QWebPage
by QTextBrowser
, which seems more lightweight. However, I could not find functions in the Qt-API similar to the parsing functions of QWebElement
. So far my code relies on QWebElement::findFirst()
, QWebElement::nextSibling()
and finally QWebElement::takeFromDocument()
.
So, is there an almost painless possibility of implementing (or using) QTextBrowser
as a HTML parser? Maybe even a 'best practice'?
I do not need to evaluate any JavaScript though it is very likely inlined in the HTML pages. Neither do I need to use CSS for styling, though it is heavily used in the HTML pages in question. I just need to retrieve certain HTML blocks (as table rows) based on their id or CSS class.
PS: I'm only willing of using present C++ HTML parsing libraries in case all feasible and rational attempts using pure Qt fail.
PPS: Just for the sake of seeing and knowing them, I'd also like to get to now extraordinary solutions. ;-)
Here is the part of my current code, where I parse and remove certain parts of the HTML page using QWebElement
. reportPage
is a QWebPage
object.
reportPage->document().findFirst( "table[id=gadgettable]" ).findFirst( "tr[class=c2]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-left-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "td[id=gadgettable-right-td]" ).takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();
reportPage->mainFrame()->documentElement().findFirst( "table[id=gadgettable]" ).findFirst( "tr" ).nextSibling().takeFromDocument();