-1

I am using a proxy rotation in my project to prevent being banned from a website, I have to scrape a list of urls http://website/0001 to http://website/9999 and when it's detect that I am scraping they send me to the website/contact.html.

I already have my proxy list in the settings
ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # ... ]

And I created this Spider:

    next_page_url = response.url[17:]//getting the relative url from website/page

    if next_page_url == "contact.html":

        absolute_next_page = response.urljoin(last_page)
        yield Request(absolute_next_page)
        //should try the same page with different proxy
    else:
        next_page_url = int(next_page_url)+1
        last_page = str(next_page_url).zfill(4)
        absolute_next_page = response.urljoin(last_page)
        yield Request(absolute_next_page)`

But it gives an error saying UnboundLocalError: local variable 'last_page' referenced before assignment

How can I specify that the proxy is dead in this spider? Or is there another way to do the same thing?

ucMedia
  • 4,105
  • 4
  • 38
  • 46
Gian Carlo
  • 3
  • 1
  • 3

1 Answers1

0

What are you trying to ask?

You are saying you got error

UnboundLocalError: local variable 'last_page' referenced before assignment

This error states that you are trying to use a variable that is not currency initialized.

So to prevent this error, change your code like this

next_page_url = response.url[17:]//getting the relative url from website/page

next_page_url = int(next_page_url)+1
last_page = str(next_page_url).zfill(4)
absolute_next_page = response.urljoin(last_page)

if next_page_url == "contact.html":

        next_page_url = int(next_page_url)+1
        absolute_next_page = response.urljoin(last_page)

        req = Request(url = absolute_next_page)

        // If you want to try the same link again, then do this
        // req = Request(url = response.url)

        req.meta['proxy'] = random.choice(ROTATING_PROXY_LIST) // choose a random proxy

        yield req

else:

        yield Request(absolute_next_page)
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • Sorry, forgot to mention, I've initialized the last_page as a global variable after the start_urls variable, because if it enter in contact.html, it have to return to the same link that I was trying to access in the last request, I still don't know how to do that. – Gian Carlo Jun 21 '17 at 12:34
  • Hmm, thats a class variable, not local variable. if you want to access class variable from within a function, do `self.last_page` – Umair Ayub Jun 21 '17 at 12:41
  • Thanks, that worked! but now it's says NameError: name 'random' is not defined, I thought random was already defined, should I create this? – Gian Carlo Jun 21 '17 at 12:50
  • Hehehe you look newbie to Python, add `import random` at the top of file – Umair Ayub Jun 21 '17 at 12:50