1

I'm trying to parse some HTML with C++ to extract all urls from the HTML (the urls can be inside the href and src attributes).

I tried to use Webkit to do the heavy work for me but for some reason when I load a frame with HTML the generated document is all wrong (if I make Webkit get the page from the web the generated document is just fine but Webkit also downloads all images, styles, and scripts and I don't want that)

Here is what I tried to do:

frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("a"); // Doesn't find all links
QList<QWebElement> imgs = document.findAll("img"); // Doesn't find all images
QList<QWebElement> imgs = document.findAll("script");// Doesn't find all scripts
qDebug() << document.toInnerXml(); // Print a completely messed-up document with several missing elements

What am I doing wrong? Is there an easy way to parse HTML with Qt? (Or some other lightweight library)

Raphael
  • 7,972
  • 14
  • 62
  • 83
  • 1
    1. What "generated document"? 2. What do you mean by "all wrong"? 3. What is the expected behavior? 4. What is the actual behavior? – Billy ONeal May 22 '11 at 05:48
  • @Billy ONeal - When I load the frame with HTML the document structure inside the frame is missing several elements. (this does not happen if I load the page from the web using page->load(url)). – Raphael May 22 '11 at 05:52
  • @ Billy ONeal - When I print the loaded document I can see that it has just some elements of the original HTML. If you put this code in a simple program, compile it you'll see what I'm talking about. – Raphael May 22 '11 at 05:55

1 Answers1

2

You can always use XPath expressions to make your parsing life easier, take a look at this for instance.

or you can do something like this

QWebView* view = new QWebView(parent);
view.load(QUrl("http://www.your_site.com"));
QWebElementCollection elements = view.page().mainFrame().findAllElements("a");
snoofkin
  • 8,725
  • 14
  • 49
  • 86
  • This only works if the HTML was loaded from the web. If I load the HTML manually it will break on the malformed tags that are present on 90% of the websites. – Raphael May 22 '11 at 14:06