2

I got the proxy list with proxybroker.

sudo pip install proxybroker
proxybroker grab --countries US --limit 100 --outfile proxies.txt

To change from the format <Proxy US 0.00s [] 104.131.6.78:80> into 104.131.6.78:80 with grep.

 grep -oP  \([0-9]+.\){3}[0-9]+:[0-9]+   proxies.txt   > proxy.csv

All the proxy in proxy.csv in the following format.

cat proxy.csv
104.131.6.78:80
104.197.16.8:3128
104.131.94.221:8080
63.110.242.67:3128

I wrote my scrawler according to the webpage.
Multiple Proxies

Here is my frame structure--test.py.

import scrapy,urllib.request
import os,csv

class TestSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["xxxx.com"]

    def __init__(self, *args, **kw):
        self.timeout = 10
        csvfile = open('proxy.csv')
        reader = csv.reader(csvfile)
        ippool = [row[0] for row in reader]
        self.proxy_pool =  ippool  

    def start_requests(self):
        yield scrapy.Request(url , callback=self.parse)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req

    def parse(self, response):
        do something

The error info occurs when to run the spider with scrapy runspider test.py

Connection was refused by other side: 111: Connection refused.

With the same proxy got from proxybroker ,i use my own way to download the url set instead of scrapy.
To make it simple,all broken proxy ip remain instead of being removed.
The codes snippet following is to test whether proxy ip can be used instead of downloading url set perfectly.
The program structure are as following.

import time
import csv,os,urllib.request
data_dir = "/tmp/"

urls = set #omit how to get it.

csvfile = open(data_dir + 'proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader] 
ip_len = len(ippool)
ipth = 0 

for ith,item in enumerate(urls):
    time.sleep(2)
    flag = 1
    if ipth >= ip_len : ipth =0 
    while(ipth <ip_len and flag == 1):
        try : 
            handler = urllib.request.ProxyHandler({'http':ippool[ipth]})  
            opener = urllib.request.build_opener(handler)
            urllib.request.install_opener(opener)  
            response = urllib.request.urlopen(urls[ith]).read().decode("utf8")
            fh = open(data_dir + str(ith),"w")
            fh.write(response)
            fh.close()
            ipth = ipth + 1 
            flag = 0
            print(urls[ith] + "downloaded")
        except :
            print("can not downloaded" + urls[ith]) 

Many urls can be downloaded with proxy grabbed by proxybroker.
It is clear that :

  1. many proxy ip grabbed by proxybroker can be used,many of them are free and stable.
  2. some bug in my scrapy codes.

How to fix bugs in my scrapy?

vezunchik
  • 3,669
  • 3
  • 16
  • 25
showkey
  • 482
  • 42
  • 140
  • 295

1 Answers1

0

try using the scrapy-proxies

In your Settings.py you can make changes something like this:

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

Hopefully this will help you, as this solved my problem too.

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139