6

I want to get the source (HTML) of a webpage, for example the homepage of StackOverflow.

This is what I've coded so far:

QNetworkAccessManager manager;
QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url)));

QString html = response->readAll(); // Source should be stored here

But nothing happens! When I try to get the value of the html string it's empty ("").

So, what to do? I am using Qt 5.3.1.

Alaa Salah
  • 1,047
  • 3
  • 12
  • 28

4 Answers4

8

You need to code it in asynchronous fashion. C++11 and Qt come to the rescue. Just remember that the body of the lambda will execute later from the event loop.

// https://github.com/KubaO/stackoverflown/tree/master/questions/html-get-24965972
#include <QtNetwork>
#include <functional>

void htmlGet(const QUrl &url, const std::function<void(const QString&)> &fun) {
   QScopedPointer<QNetworkAccessManager> manager(new QNetworkAccessManager);
   QNetworkReply *response = manager->get(QNetworkRequest(QUrl(url)));
   QObject::connect(response, &QNetworkReply::finished, [response, fun]{
      response->deleteLater();
      response->manager()->deleteLater();
      if (response->error() != QNetworkReply::NoError) return;
      auto const contentType =
            response->header(QNetworkRequest::ContentTypeHeader).toString();
      static QRegularExpression re("charset=([!-~]+)");
      auto const match = re.match(contentType);
      if (!match.hasMatch() || 0 != match.captured(1).compare("utf-8", Qt::CaseInsensitive)) {
         qWarning() << "Content charsets other than utf-8 are not implemented yet:" << contentType;
         return;
      }
      auto const html = QString::fromUtf8(response->readAll());
      fun(html); // do something with the data
   }) && manager.take();
}

int main(int argc, char *argv[])
{
   QCoreApplication app(argc, argv);
   htmlGet({"http://www.google.com"}, [](const QString &body){ qDebug() << body; qApp->quit(); });
   return app.exec();
}

Unless you're only using this code once, you should put the QNetworkManager instance as a member of your controller class, or in the main, etc.

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
  • Just asking on your first if statement why do you use return if there is no NoError? – reggie Jul 26 '14 at 03:53
  • @reggie_jimac I return if there is an error (the error status is *other than* NoError). If there is an error, there's likely no valid data, and further processing is pointless. – Kuba hasn't forgotten Monica Jul 26 '14 at 14:24
  • 1
    Using objects after calling `deleteLater` can be considered bad style. If someone adds an operation that causes events processing in the middle, the code will become implicitly invalid. – Pavel Strakhov Jul 27 '14 at 11:45
  • @PavelStrakhov Nested event loops do not process `deleteLater` events, for the very reason you cite. That's another reason why pretend-synchronous programming is bad. – Kuba hasn't forgotten Monica Jul 27 '14 at 13:48
7

You have to add QEventLoop between.

QNetworkAccessManager manager;
QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url)));
QEventLoop event;
connect(response,SIGNAL(finished()),&event,SLOT(quit()));
event.exec();
QString html = response->readAll(); // Source should be stored here
MKAROL
  • 316
  • 3
  • 11
  • 1
    This is bad advice, since you're writing asynchronous code as if it were synchronous. It isn't. If you didn't forget to actually `exec()` the event loop, you'd be exposing the asker to the arbitrary consequences of `event.exec()` potentially reentering this method, or any other methods. Since most people don't design their code with such complications in mind, I consider it a source of undefined behavior, liable to format your hard drive or launch a nuclear strike. Explicitly asynchronous coding, with help of C++11 and Qt 5 is a more peaceful alternative. – Kuba hasn't forgotten Monica Jul 26 '14 at 00:15
  • Well the asker tried to get HTML by synchronous code,that is why I showed him this solution. And sometimes it is easier to do it that way . And thank you for pointing me the event.exec() mistake. – MKAROL Jul 26 '14 at 00:35
  • Yes, I agree that introducing undefined behavior into your application is easy. That doesn't mean you should do it. Qt makes asynchronous coding relatively easy thanks to signals/slots even in Qt 4. With C++11 and Qt 5 there's really zero excuse to suggesting spinning local event loops and similar craziness. – Kuba hasn't forgotten Monica Jul 26 '14 at 00:39
  • Thank you so much. People didn't like your solution & said it's a really bad one, but indeed it's the only one which worked for me! Thanks again. – Alaa Salah Jul 27 '14 at 19:17
6

QNetworkAccessManager works asynchronously. You call readAll() immediately after get(), but the request has not been made in that moment. You need to use QNetworkAccessManager::finished signal as shown in the documentation and move readAll() to the slot connected to this signal.

Pavel Strakhov
  • 39,123
  • 5
  • 88
  • 127
0

A short answer including the essential part in C++17:

const auto manager = new QNetworkAccessManager(this);
connect(manager, &QNetworkAccessManager::finished,
        this, [](auto reply) {
            qDebug() << reply->readAll();
        });
manager->get(QNetworkRequest({ "https://www.google.com" }));
juzzlin
  • 45,029
  • 5
  • 38
  • 50