5

I have been trying to crawl a website that has seemingly identified and blocked my IP and is throwing a 429 Too many requests response.

I installed scrapy-proxies from this link: https://github.com/aivarsk/scrapy-proxies and followed the given instructions. I got a list of proxies from here: http://www.gatherproxy.com/ and now here is how my settings.py and proxylist.txt look like:

Settings.py

BOT_NAME = 'project'
SPIDER_MODULES = ['project.spiders']
NEWSPIDER_MODULE = 'project.spiders'
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [429, 500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

PROXY_LIST = "filepath\proxylist.txt"
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 2

PROXY_MODE = 0
DOWNLOAD_HANDLERS = {'s3': None}

EXTENSIONS = {
   'scrapy.telnet.TelnetConsole': None
}

proxylist.txt

http://195.208.172.20:8080
http://154.119.56.179:9999
http://124.12.50.43:8088
http://61.7.168.232:52136
http://122.193.188.236:8118

Yet when I run my crawler, I get the following error:

[scrapy.proxies] DEBUG: Proxy user pass not found

I tried to search for the specific error on google but could not find any solutions.

Help will be highly appreciated. Thanks a lot in advance.

Kunwar
  • 67
  • 3
  • 7
  • This might actually only be a information that your list does not have any passwords and user names on each line, which is OK if they provide anonymous access. Have a look here: https://github.com/aivarsk/scrapy-proxies/blob/master/scrapy_proxies/randomproxy.py The else just loggs the info: log.debug('Proxy user pass not found') – merlin Dec 25 '18 at 09:24

1 Answers1

5

I suggest you to create your own middleware to specify the IP:PORT like this and place this proxies.py middleware file inside your project's middleware folder:

class ProxiesMiddleware(object):
    def __init__(self, settings):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def process_request(self, request, spider):
        request.meta['proxy'] = "http://IP:PORT"

Add ProxiesMiddleware middleware line to your settings.py

DOWNLOADER_MIDDLEWARES = {
   'yourproject.middleware.proxies.ProxiesMiddleware':400,
}
vezunchik
  • 3,669
  • 3
  • 16
  • 25
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • It gives this error: ImportError: No module named proxies – Kunwar Nov 07 '17 at 16:49
  • @Kunwar It probably depends on you folder heirarchy. You'll need to locate where exactly your `ProxiesMiddleware` file/function is. You probably put it directly in your `middleware` folder/file, in which case you should remove the `.proxies` from that item in your `DOWNLOADER_MIDDLEWARES` list. – Johiasburg Frowell Nov 08 '17 at 20:06