I got the proxy list with proxybroker.
sudo pip install proxybroker
proxybroker grab --countries US --limit 100 --outfile proxies.txt
To change from the format <Proxy US 0.00s [] 104.131.6.78:80>
into 104.131.6.78:80
with grep.
grep -oP \([0-9]+.\){3}[0-9]+:[0-9]+ proxies.txt > proxy.csv
All the proxy in proxy.csv in the following format.
cat proxy.csv
104.131.6.78:80
104.197.16.8:3128
104.131.94.221:8080
63.110.242.67:3128
I wrote my scrawler according to the webpage.
Multiple Proxies
Here is my frame structure--test.py.
import scrapy,urllib.request
import os,csv
class TestSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["xxxx.com"]
def __init__(self, *args, **kw):
self.timeout = 10
csvfile = open('proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader]
self.proxy_pool = ippool
def start_requests(self):
yield scrapy.Request(url , callback=self.parse)
def get_request(self, url):
req = Request(url=url)
if self.proxy_pool:
req.meta['proxy'] = random.choice(self.proxy_pool)
return req
def parse(self, response):
do something
The error info occurs when to run the spider with scrapy runspider test.py
Connection was refused by other side: 111: Connection refused.
With the same proxy got from proxybroker
,i use my own way to download the url set instead of scrapy.
To make it simple,all broken proxy ip remain instead of being removed.
The codes snippet following is to test whether proxy ip can be used instead of downloading url set perfectly.
The program structure are as following.
import time
import csv,os,urllib.request
data_dir = "/tmp/"
urls = set #omit how to get it.
csvfile = open(data_dir + 'proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader]
ip_len = len(ippool)
ipth = 0
for ith,item in enumerate(urls):
time.sleep(2)
flag = 1
if ipth >= ip_len : ipth =0
while(ipth <ip_len and flag == 1):
try :
handler = urllib.request.ProxyHandler({'http':ippool[ipth]})
opener = urllib.request.build_opener(handler)
urllib.request.install_opener(opener)
response = urllib.request.urlopen(urls[ith]).read().decode("utf8")
fh = open(data_dir + str(ith),"w")
fh.write(response)
fh.close()
ipth = ipth + 1
flag = 0
print(urls[ith] + "downloaded")
except :
print("can not downloaded" + urls[ith])
Many urls can be downloaded with proxy grabbed by proxybroker
.
It is clear that :
- many proxy ip grabbed by
proxybroker
can be used,many of them are free and stable. - some bug in my scrapy codes.
How to fix bugs in my scrapy?