0

I'm using PyQT4 (for the first time) to scrape some pages. Since I try to scrape multiple pages I use QEventloop. However I could not add loadFinished signal to code. Here is how my code looks like this:

   # Imports
import requests
from bs4 import BeautifulSoup
import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4.QtNetwork import QNetworkRequest
from PyQt4.QtGui import *
from lxml import html
import csv
import win_unicode_console
import time
# Main setting
DIR = "data"
URL = "https://addons.mozilla.org"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}

def Render(url):
    page = QWebPage()
    loop = QEventLoop() # Create event loop
    page.mainFrame().loadFinished.connect(loop.quit) # Connect loadFinished to loop quit
    page.mainFrame().load(QUrl(url))
    loop.exec_() # Run event loop, it will end on loadFinished
    return page.mainFrame().toHtml()

app = QApplication(sys.argv)

def pagination(page):
    page_url = "https://addons.mozilla.org/en-US/firefox/extensions/?sort=users&page=" + str(page)
    response = requests.get(page_url, headers=headers)
    soup = BeautifulSoup(response.text, "lxml")
    items = soup.findAll("div", class_="item addon")
    for item in items:
        time.sleep(2)
        item = URL + item.h3.select('a')[0].get('href')
        print(item)
        addon_scraper(item)

def addon_scraper(url):
    time.sleep(7)
    result = Render(url)
    print(result)
    soup = BeautifulSoup(result, "lxml")
    addon_name = soup.select("#addon > hgroup > h1 > span")[0].get_text()
    print(addon_name)
    addon_author = soup.select("#addon > hgroup > h4 > a")[0].get_text()
    category = soup.select("#related > ul")[0].get_text().strip()
    with open("category_list.csv", "a", newline="", encoding="utf-16") as f:
        writer = csv.writer(f, dialect="excel-tab")
        writer.writerow([addon_name, addon_author, category])


# Run the scraper
if __name__ == "__main__":
    win_unicode_console.enable() # Enable unicode support in command line interface
    for i in range(1, 100):
        print(i)
        pagination(i)
        app.exit()

At the end it just restarts the script and does nothing. I was trying to implement solution provided by user Mip here:Web Scraping Multiple Links with PyQt / QtWebkit I think adding user agent to above app and implicit sleep (similar to selenium case) would solve my problem. But I couldn't manage to do it. Now I get the following error. I think it is because PyQt4 exits the loop before source content is loaded:

Traceback (most recent call last): File "main.py", line 56, in <module> pagination(i) File "mozilla_file.py", line 36, in pagination addon_scraper(item) File "mozilla_file.py", line 46, in addon_scraper category = soup.select("#related > ul")[0].get_text().strip() IndexError: list index out of range
edyvedy13
  • 2,156
  • 4
  • 17
  • 39
  • What other answers? What do you mean "shell is restarted"? What errors? Please read the guidance on [ask] and how to provide a [mcve]. – ekhumoro Oct 27 '17 at 15:47
  • With the current script, when I run it basically does nothing, and says "RESTART Shell" – edyvedy13 Oct 27 '17 at 16:22
  • No, it does not do that. It is clearly not possible to run the code currently in your question, as there are several parts of it missing. Please read the second link I gave in my previous comment if you want help. – ekhumoro Oct 27 '17 at 16:25
  • I shared the entire script – edyvedy13 Oct 27 '17 at 16:33
  • So what exactly is the problem? When I run the script, it prints a url, some html, and a name. After a while, it stops with an `IndexError` whilst parsing the html - but that has nothing to do with signals or event-loops. – ekhumoro Oct 27 '17 at 16:55
  • I added a snapshot, this is what I get – edyvedy13 Oct 27 '17 at 17:26
  • After a while, you get an IndexError because sometimes rendering takes longer than 7 minutes. It is another problem, I should add something like implicit sleep and also user agent probably – edyvedy13 Oct 27 '17 at 17:32
  • Run the example in a command window, rather than IDLE. – ekhumoro Oct 27 '17 at 17:36
  • In this case I get python stop working error – edyvedy13 Oct 27 '17 at 17:48
  • I have executed your code commenting the line `win_unicode_console.enable ()` since it does not work for linux and I have obtained the following result: https://pastebin.com/1zptf1yy, The execution was finished when there was an error:`Traceback (most recent call last): File "main.py", line 56, in pagination(i) File "main.py", line 36, in pagination addon_scraper(item) File "main.py", line 46, in addon_scraper category = soup.select("#related > ul")[0].get_text().strip() IndexError: list index out of range` – eyllanesc Oct 27 '17 at 22:51
  • I think it is because the script does not wait page to be loaded, and I don't know how to achieve it. – edyvedy13 Oct 27 '17 at 22:54
  • @edyvedy13 Do you get the same? – eyllanesc Oct 27 '17 at 22:58
  • At work, I was getting nothing, but I tried it at home with my own pc, now I also started to get IndexError.() – edyvedy13 Oct 27 '17 at 22:59
  • @edyvedy13 Maybe in your work the firewalls are generating that nothing returns, I recommend using wireshark to review what you are sending and receiving. – eyllanesc Oct 27 '17 at 23:02
  • Oh ok let me take a look, Also I will update the question for IndexError part, thank you very much for your answer – edyvedy13 Oct 27 '17 at 23:06
  • The error is given by this line: for i in range (1, 100), why should you iterate 100 times? – eyllanesc Oct 27 '17 at 23:11
  • To scrap 100 pages I guess, but the element exist basically on each and every page – edyvedy13 Oct 27 '17 at 23:14

0 Answers0