Crawler in Python, urlopen not working

Question

I am playing around trying to extract some info from a webpage and I have the following code:

import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup

InitUrl="https://mtgsingles.gr/search?q="
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages=5

for i in range(0, NumOfPages):
    if i == 0:
        Url = InitUrl
    else:
        Url = URL_Next

    UClient = uReq(Url)  # downloading the url
    page_html = UClient.read()
    UClient.close()

    page_soup = soup(page_html, "html.parser")


    cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})


    for card in cards:
        card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

        if len(card.div.contents) > 3:
            cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
        else:
            cardP_T = "Does not exist"

        cardType = card.contents[3].text
        print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")


    try:
        URL_Next = "https://mtgsingles.gr" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
        print("The next URL is: " + URL_Next + "\n")
    except IndexError:
        print("Crawling process completed! No more infomation to retrieve!")
    else:
        print("The next URL is: " + URL_Next + "\n")
        NumOfCrawledPages += 1
        Url= URL_Next

    finally:
        print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

The code runs fine and no errors occur but the results are not as expected. I am trying to extract some information from the page as well as the url of the next page. Ultimately I would like the program to run 5 times and crawl 5 pages. But this code crawls the initial page given (InitUrl="xyz.com") all 5 times and does not proceed in the next page url that is extracted. I tried debugging it by entering some print statements to see where the problem lies and I think that the problem lies at these statements:

 UClient = uReq(Url) 
 page_html = UClient.read()
 UClient.close()

For some reason urlopen does not work repeatedly in the for loop. Why does this happen? Is it wrong to use urlopen in a for statement?

score 0 · Answer 1 · answered May 25 '18 at 01:30

0

This site get data by Ajax request. So you must send post requests to get data.

Tip: Select Url correctly for example: https://mtgsingles.gr/search?ajax=products-listing&q=

answered May 25 '18 at 01:30

Alihossein shahabi

4,034
2
33
53

Crawler in Python, urlopen not working

1 Answers1