How to setting proxy in Python Scrapy

Question

I've use Python 2.7 and Scrapy 1.3.0

and I need set proxy to access web

how to set it?

this is my script in parse

if theurl not in self.ProcessUrls:
   self.ProcessUrls.append(theurl)
   yield scrapy.Request(theurl, callback=self.parse)

if i need to confirm the crawl new is not repeat how to do it? if not repeat need to crawl this new url

I can't set environment variable, it will affect other service and job,can i just setting it on scrapy Script? — ZivHus, Aug 01 '17 at 03:57

Nabin · Answer 1 · 2017-08-01T07:29:58.710

4

We can use the following:

request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request

A simple implementation is like below:

import scrapy

class MySpider(scrapy.Spider):
    name = "examplespider"
    allowed_domains = ["somewebsite.com"]
    start_urls = ['http://somewebsite.com/']

    def parse(self, response):
        # Here example.com is used. We usually get this URL by parsing desired webpage
        request = scrapy.Request(url='example.com', callback=self.parse_url)
        request.meta['proxy'] = "host:port"
        yield request

    def parse_url(self, response):
        # Do rest of the parsing work
        pass

If you want to use the proxy in initial:

Add the following as spider class field

class MySpider(scrapy.Spider):
        name = "examplespider"
        allowed_domains = ["somewebsite.com"]
        start_urls = ['http://somewebsite.com/']
        custom_settings = {
        'HTTPPROXY_ENABLED': True
    }

And then use start_requests() method as below:

    def start_requests(self):
        urls = ['example.com']
        for url in urls:
            proxy = 'some proxy'
            yield scrapy.Request(url=url, callback=self.parse, meta={'proxy': proxy})

    def parse(self, response):
        item = StatusCehckerItem()
        item['url'] = response.url
        return item

edited Aug 01 '17 at 07:29

answered Aug 01 '17 at 05:24

Nabin

11,216
8
63
98

can i setting the proxy on initial? – ZivHus Aug 01 '17 at 06:05
where can i add custom_settings ? – ZivHus Aug 01 '17 at 06:19
where can i get the HTML response? parse? the parse function is under start_request? – ZivHus Aug 01 '17 at 07:27
can i just use start_urls ? urls = ['example.com'] can i do this ?urls = self.start_urls ? – ZivHus Aug 01 '17 at 07:29
Yes you can do that. Also the parse function is not inside the start_requests function. See the edit. Also you can find html inside parse function – Nabin Aug 01 '17 at 07:31
this is my script if theurl not in self.ProcessUrls: self.ProcessUrls.append(theurl) yield scrapy.Request(theurl, callback=self.parse) if i need to confirm the crawl new is not repeat how to do it? – ZivHus Aug 01 '17 at 07:42
Cannot understand this in comment. Please update your question. – Nabin Aug 01 '17 at 07:43
I've update my question – ZivHus Aug 01 '17 at 07:49
did you have any idea? – ZivHus Aug 02 '17 at 03:02

score 0 · Answer 2 · answered Aug 01 '17 at 03:19

0

You have to set http_proxy, https_proxy environment variable. Refer this: proxy for scrapy

answered Aug 01 '17 at 03:19

rsu8

491
4
10

I can't set environment variable, it will affect other service and job,can i just setting it on scrapy Script? – ZivHus Aug 01 '17 at 03:58
Have you tried this [scrapy_proxies](https://github.com/aivarsk/scrapy-proxies) – rsu8 Aug 01 '17 at 04:12
Yes, I've try this one, but I don't know DOWNLOADER_MIDDLEWARES is value is for what? and If I just have one proxy need to setting it's still need to read the text file ? – ZivHus Aug 01 '17 at 05:17

How to setting proxy in Python Scrapy

2 Answers2