0

How to use proxy in python web scraping script that can scrape the data from amazon. I need to learn how to use the proxy with below script the script is here

import scrapy
from urls import start_urls
import re

class BbbSpider(scrapy.Spider):

    AUTOTHROTTLE_ENABLED = True
    name = 'bbb_spider'
    # start_urls = ['http://www.bbb.org/chicago/business-reviews/auto-repair-and-service-equipment-and-supplies/c-j-auto-parts-in-chicago-il-88011126']

  
    def start_requests(self):
        for x in start_urls:
            yield scrapy.Request(x, self.parse)

    def parse(self, response):
        
        brickset = str(response)
        NAME_SELECTOR = 'normalize-space(.//div[@id="titleSection"]/h1[@id="title"]/span[@id="productTitle"]/text())'
        #PAGELINK_SELECTOR = './/div[@class="info"]/h3[@class="n"]/a/@href'
        ASIN_SELECTOR = './/table/tbody/tr/td/div[@class="content"]/ul/li[./b[text()="ASIN: "]]//text()'
        #LOCALITY = 'normalize-space(.//div[@class="info"]/div/p/span[@class="locality"]/text())'
        #PRICE_SELECTOR = './/div[@id="price"]/table/tbody/tr/td/span[@id="priceblock_ourprice"]//text()'
        PRICE_SELECTOR = '#priceblock_ourprice'
        STOCK_SELECTOR = 'normalize-space(.//div[@id="availability"]/span/text())'
        PRODUCT_DETAIL_SELECTOR = './/table//div[@class="content"]/ul/li//text()'
        PRODUCT_DESCR_SELECTOR = 'normalize-space(.//div[@id="productDescription"]/p/text())'
        IMAGE_URL_SELECTOR = './/div[@id="imgTagWrapperId"]/img/@src'

        yield {
            'name': response.xpath(NAME_SELECTOR).extract_first().encode('utf8'),
            'pagelink': response.url,
            #'asin' : str(re.search("<li><b>ASIN: </b>([A-Z0-9]+)</li>",brickset).group(1).strip()),
            'price' : str(response.css(PRICE_SELECTOR).extract_first().encode('utf8')),
            'stock' : str(response.xpath(STOCK_SELECTOR).extract_first()),
            'product_detail' : str(response.xpath(PRODUCT_DETAIL_SELECTOR).extract()),
            'product_description' : str(response.xpath(PRODUCT_DESCR_SELECTOR).extract()),
            'img_url' : str(response.xpath(IMAGE_URL_SELECTOR).extract_first()),
        }

and the start_url file is here

start_urls = ['https://www.amazon.co.uk/d/Hair-Care/Loreal-Majirel-Hair-Colour-Tint-Golden-Mahogany/B0085L50QU', 'https://www.amazon.co.uk/d/Hair-Care/Michel-Mercier-Ultimate-Detangling-Wooden-Brush-Normal/B00TE1WH7U']
itsmnthn
  • 1,898
  • 1
  • 18
  • 30

2 Answers2

1

As far as I know,there are two ways to use proxy for Python code:

  • Set the environment variables http_proxy and https_proxy,maybe it's the easiest way to use proxy.

    Windows:

    set http_proxy=http://proxy.myproxy.com  
    set https_proxy=https://proxy.myproxy.com  
    python get-pip.py  
    

    Linux/OS X:

    export http_proxy=http://proxy.myproxy.com
    export https_proxy=https://proxy.myproxy.com
    sudo -E python get-pip.py
    
  • Support for HTTP proxies is provided since Scrapy 0.8 through the HTTP Proxy downloader middleware. ,you can check out HttpProxyMiddleware.

    This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request objects.

    Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:

    http_proxy
    https_proxy
    no_proxy
    

Hope this helps.

McGrady
  • 10,869
  • 13
  • 47
  • 69
1

If you want to do inside code.

Do this.

def start_requests(self):
    for x in start_urls:
        req = scrapy.Request(x, self.parse)
        req.meta['proxy'] = 'your_proxy_ip_here'
        yield req

And don't forget to put this in settings.py file

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 1,
}
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146