0

I am try to scrape content from a dynamic website that requires login. I found this piece of code that works for PyQt4 Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?

#!/usr/bin/python
# -*- coding: latin-1 -*-
import sys
import base64
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork

class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)

    username = 'username'
    password = 'password'

    base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
    authheader = "Basic %s" % base64string

    headerKey = QByteArray("Authorization")
    headerValue = QByteArray(authheader)

    url = QUrl(url)
    req = QtNetwork.QNetworkRequest()
    req.setRawHeader(headerKey, headerValue)
    req.setUrl(url)

    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)


    self.mainFrame().load(req)
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

def main():
    url = 'http://www.google.com'
    r = Render(url)
    html = r.frame.toHtml()

How can I translate thesame to work for PyQt5 ?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
cnuvadga
  • 109
  • 1
  • 10

1 Answers1

2

You have to use QWebEnginePage so the tasks are asynchronous as I obtained from the HTML, also QtWebEngine does not use QNetworkRequest so you must use QWebEngineHttpRequest:

import sys

from PyQt5.QtCore import QByteArray, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineHttpRequest
from PyQt5.QtWebEngineWidgets import QWebEnginePage


class Render(QWebEnginePage):
    def __init__(self, url):
        app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.loadFinished.connect(self._loadFinished)

        self._html = ""

        username = "username"
        password = "password"
        base64string = QByteArray(("%s:%s" % (username, password)).encode()).toBase64()
        request = QWebEngineHttpRequest(QUrl.fromUserInput(url))
        equest.setHeader(b"Authorization", b"Basic: %s" % (base64string,))

        self.load(request)

        app.exec_()

    @property
    def html(self):
        return self._html

    def _loadFinished(self):
        self.toHtml(self.handle_to_html)

    def handle_to_html(self, html):
        self._html = html
        QApplication.quit()


def main():
    url = "http://www.google.com"
    r = Render(url)
    print(r.html)


if __name__ == "__main__":
    main()
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • Thanks @eyllanesc, the current solution spits out chunks of javascript code embeded in html. How do I get the page content that is being loaded by javascript ? – cnuvadga Sep 24 '20 at 23:46
  • @cnuvadga I don't understand you, my code only solves the current question of your post, nothing more. If you have other problems then you must create a new post and provide a [mre]. At SO we do not help projects but we solve specific questions. – eyllanesc Sep 24 '20 at 23:48
  • After loading the page content, tried to access an html div element with classname "grp_0" but is returning None – cnuvadga Sep 25 '20 at 00:02
  • @cnuvadga 1) The question is how to access a page that requires headers to authenticate, which is literally translating the code from pyqt4 (QtWebkit) to PyQt5 (QtWebEngine) so my answer does that. 2) Although it is not my duty, I saw that page "http://www.google.com" does not have any div grp_0. As I already pointed out: If you have another problem (accessing a certain div) then you must create a new post with an MRE. – eyllanesc Sep 25 '20 at 00:08
  • @cnuvadga 3) I will not discuss any more about these issues but I will only respond to those that directly involve my answer, so I ask you to read [ask] and pass the [tour] so that you know (or reread) the SO rules. – eyllanesc Sep 25 '20 at 00:08
  • Thanks for your time and effort @ellyanesc, don't be offended by my naivity I am new to this library – cnuvadga Sep 25 '20 at 00:11
  • @cnuvadga It does not matter that you are new to Qt / PyQt since the dynamics in SO is generic: We solve specific problems and nothing else, we do not pretend to solve all the problems but only the specific question, if the problem has no limits then it would never be solved . In your previous post you pointed out a specific problem and then my solution covers that problem, the same I did with your current post. – eyllanesc Sep 25 '20 at 00:15
  • I'll create another post as you recommended – cnuvadga Sep 25 '20 at 00:17