How to use proxy in python script?

Question

How to use proxy in python web scraping script that can scrape the data from amazon. I need to learn how to use the proxy with below script the script is here

import scrapy
from urls import start_urls
import re

class BbbSpider(scrapy.Spider):

    AUTOTHROTTLE_ENABLED = True
    name = 'bbb_spider'
    # start_urls = ['http://www.bbb.org/chicago/business-reviews/auto-repair-and-service-equipment-and-supplies/c-j-auto-parts-in-chicago-il-88011126']

  
    def start_requests(self):
        for x in start_urls:
            yield scrapy.Request(x, self.parse)

    def parse(self, response):
        
        brickset = str(response)
        NAME_SELECTOR = 'normalize-space(.//div[@id="titleSection"]/h1[@id="title"]/span[@id="productTitle"]/text())'
        #PAGELINK_SELECTOR = './/div[@class="info"]/h3[@class="n"]/a/@href'
        ASIN_SELECTOR = './/table/tbody/tr/td/div[@class="content"]/ul/li[./b[text()="ASIN: "]]//text()'
        #LOCALITY = 'normalize-space(.//div[@class="info"]/div/p/span[@class="locality"]/text())'
        #PRICE_SELECTOR = './/div[@id="price"]/table/tbody/tr/td/span[@id="priceblock_ourprice"]//text()'
        PRICE_SELECTOR = '#priceblock_ourprice'
        STOCK_SELECTOR = 'normalize-space(.//div[@id="availability"]/span/text())'
        PRODUCT_DETAIL_SELECTOR = './/table//div[@class="content"]/ul/li//text()'
        PRODUCT_DESCR_SELECTOR = 'normalize-space(.//div[@id="productDescription"]/p/text())'
        IMAGE_URL_SELECTOR = './/div[@id="imgTagWrapperId"]/img/@src'

        yield {
            'name': response.xpath(NAME_SELECTOR).extract_first().encode('utf8'),
            'pagelink': response.url,
            #'asin' : str(re.search("<li><b>ASIN: </b>([A-Z0-9]+)</li>",brickset).group(1).strip()),
            'price' : str(response.css(PRICE_SELECTOR).extract_first().encode('utf8')),
            'stock' : str(response.xpath(STOCK_SELECTOR).extract_first()),
            'product_detail' : str(response.xpath(PRODUCT_DETAIL_SELECTOR).extract()),
            'product_description' : str(response.xpath(PRODUCT_DESCR_SELECTOR).extract()),
            'img_url' : str(response.xpath(IMAGE_URL_SELECTOR).extract_first()),
        }

and the start_url file is here

start_urls = ['https://www.amazon.co.uk/d/Hair-Care/Loreal-Majirel-Hair-Colour-Tint-Golden-Mahogany/B0085L50QU', 'https://www.amazon.co.uk/d/Hair-Care/Michel-Mercier-Ultimate-Detangling-Wooden-Brush-Normal/B00TE1WH7U']

possible duplicate? : http://stackoverflow.com/questions/4710483/scrapy-and-proxies — Kelvin, Feb 15 '17 at 15:35

score 1 · Answer 1 · answered Feb 15 '17 at 08:48

As far as I know,there are two ways to use proxy for Python code:

Set the environment variables http_proxy and https_proxy,maybe it's the easiest way to use proxy.

Windows:

set http_proxy=http://proxy.myproxy.com  
set https_proxy=https://proxy.myproxy.com  
python get-pip.py

Linux/OS X:

export http_proxy=http://proxy.myproxy.com
export https_proxy=https://proxy.myproxy.com
sudo -E python get-pip.py

Support for HTTP proxies is provided since Scrapy 0.8 through the HTTP Proxy downloader middleware. ,you can check out HttpProxyMiddleware.

This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request objects.

Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
```
http_proxy
https_proxy
no_proxy
```

Hope this helps.

@McGrady gave you a solution for the script, the middleware is the way to go — Rafael Almeida, Feb 15 '17 at 10:20

score 1 · Answer 2 · answered Feb 15 '17 at 15:22

If you want to do inside code.

Do this.

def start_requests(self):
    for x in start_urls:
        req = scrapy.Request(x, self.parse)
        req.meta['proxy'] = 'your_proxy_ip_here'
        yield req

And don't forget to put this in settings.py file

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 1,
}

How to use proxy in python script?

2 Answers2