Parsing HTML with C++ (using Qt preferably)

Question

I'm trying to parse some HTML with C++ to extract all urls from the HTML (the urls can be inside the href and src attributes).

I tried to use Webkit to do the heavy work for me but for some reason when I load a frame with HTML the generated document is all wrong (if I make Webkit get the page from the web the generated document is just fine but Webkit also downloads all images, styles, and scripts and I don't want that)

Here is what I tried to do:

frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("a"); // Doesn't find all links
QList<QWebElement> imgs = document.findAll("img"); // Doesn't find all images
QList<QWebElement> imgs = document.findAll("script");// Doesn't find all scripts
qDebug() << document.toInnerXml(); // Print a completely messed-up document with several missing elements

What am I doing wrong? Is there an easy way to parse HTML with Qt? (Or some other lightweight library)

1. What "generated document"? 2. What do you mean by "all wrong"? 3. What is the expected behavior? 4. What is the actual behavior? — Billy ONeal, May 22 '11 at 05:48
@Billy ONeal - When I load the frame with HTML the document structure inside the frame is missing several elements. (this does not happen if I load the page from the web using page->load(url)). — Raphael, May 22 '11 at 05:52
@ Billy ONeal - When I print the loaded document I can see that it has just some elements of the original HTML. If you put this code in a simple program, compile it you'll see what I'm talking about. — Raphael, May 22 '11 at 05:55

score 2 · Accepted Answer · answered May 22 '11 at 08:18

2

You can always use XPath expressions to make your parsing life easier, take a look at this for instance.

or you can do something like this

QWebView* view = new QWebView(parent);
view.load(QUrl("http://www.your_site.com"));
QWebElementCollection elements = view.page().mainFrame().findAllElements("a");

answered May 22 '11 at 08:18

snoofkin

8,725
14
49
86

This only works if the HTML was loaded from the web. If I load the HTML manually it will break on the malformed tags that are present on 90% of the websites. – Raphael May 22 '11 at 14:06

Parsing HTML with C++ (using Qt preferably)

1 Answers1

Linked