0

I try to get some data from a webpage in Qt. Since QWebKit is unmaintained I would like to use QXmlStreamReader but it I get error messages for some Webpages.

For example: XML Parse Error "Opening and ending tag mismatch." at http://www.google.com

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.de/?gfe_rd=cr&amp;ei=toP_WMrVKoHKXuvxnsAO">here</A>.
</BODY></HTML>

And I get HTML, HEAD, meta and TITLE.

Other error messages on valid html pages:

  • XML Parse Error "Expected '-' or 'DOCTYPE', but got '[a-zA-Z]'."
  • XML Parse Error "Entity 'raquo' not declared."

Here is my Code:

webpage = new QXmlStreamReader(data);

//emit got_webpage(&QString(data));

QStringList test;

while (!webpage->atEnd() && !webpage->hasError())
{
    QXmlStreamReader::TokenType token = webpage->readNext();

    if (token == QXmlStreamReader::StartDocument)
        continue;

    if (token == QXmlStreamReader::StartElement)
    {
        test << webpage->name().toString();
        /*if (webpage->name() == "H1")
        {
            emit got_webpage(webpage)
        }*/
    }
}

emit got_webpage(&test.join("\n"));

if (webpage->hasError())
{
    // TODO: Error handling...
    qDebug() << "XML Parse Error " << webpage->errorString();
}

webpage->clear();
delete webpage;
Community
  • 1
  • 1
Darkproduct
  • 1,062
  • 13
  • 28

1 Answers1

0

As the name suggests, QXmlStreamReader is meant for parsing XML. HTML is not based on XML, so it cannot be parsed with QXmlStreamReader.

That said, if you can convert the HTML into XHTML, you will be able to parse it with QXmlStreamReader. However, Qt has no built-in method of performing this conversion. It is possible to convert arbitrary HTML to XHTML with 3rd party libraries such as tidylib.

MrEricSir
  • 8,044
  • 4
  • 30
  • 35
  • And is there a build in way to parse HTML with Qt? I thought I could use `QXmlStreamReader` because of the answer in [this](http://stackoverflow.com/questions/18676800/how-to-parse-html-with-c-qt?noredirect=1&lq=1) thread. – Darkproduct Apr 25 '17 at 22:45
  • No, Qt has no built-in HTML parser. For an explanation of why they removed it in Qt WebEngine, [read this page](http://doc.qt.io/qt-5/qtwebenginewidgets-qtwebkitportingguide.html). – MrEricSir Apr 25 '17 at 23:08
  • Hm I cant find any Infos about an old HTML parser in QWebEngine. I think I'll lock for a 3rd party library. – Darkproduct Apr 25 '17 at 23:41