I am writing a scrapy crawler to grab info on shirts from Amazon. The crawler starts on an amazon page for some search, "funny shirts" for example, and collects all the result item containers. It then parses through each result item collecting data on the shirts.
I use ScraperAPI and Scrapy-user-agents to dodge amazon. The code for my spider is:
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
page_number = 2
keyword_file = open("keywords.txt", "r+")
all_key_words = keyword_file.readlines()
keyword_file.close()
all_links = []
keyword_list = []
for keyword in all_key_words:
keyword_list.append(keyword)
formatted_keyword = keyword.replace('\n', '')
formatted_keyword = formatted_keyword.strip()
formatted_keyword = formatted_keyword.replace(' ', '+')
all_links.append("http://api.scraperapi.com/?api_key=mykeyd&url=https://www.amazon.com/s?k=" + formatted_keyword + "&ref=nb_sb_noss_2")
start_urls = all_links
def parse(self, response):
print("========== starting parse ===========")
all_containers = response.css(".s-result-item")
for shirts in all_containers:
next_page = shirts.css('.a-link-normal::attr(href)').extract_first()
if next_page is not None:
if "https://www.amazon.com" not in next_page:
next_page = "https://www.amazon.com" + next_page
yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)
second_page = response.css('li.a-last a::attr(href)').get()
if second_page is not None and AmazonSpiderSpider.page_number < 3:
AmazonSpiderSpider.page_number += 1
yield response.follow(second_page, callback=self.parse)
def parse_dir_contents(self, response):
items = ScrapeAmazonItem()
print("============= parsing page ==============")
temp = response.css('#productTitle::text').extract()
product_name = ''.join(temp)
product_name = product_name.replace('\n', '')
product_name = product_name.strip()
temp = response.css('#priceblock_ourprice::text').extract()
product_price = ''.join(temp)
product_price = product_price.replace('\n', '')
product_price = product_price.strip()
temp = response.css('#SalesRank::text').extract()
product_score = ''.join(temp)
product_score = product_score.strip()
product_score = re.sub(r'\D', '', product_score)
product_ASIN = re.search(r'(?<=/)B[A-Z0-9]{9}', response.url)
product_ASIN = product_ASIN.group(0)
items['product_ASIN'] = product_ASIN
items['product_name'] = product_name
items['product_price'] = product_price
items['product_score'] = product_score
yield items
Crawling looks like this:
https://i.stack.imgur.com/UbVUt.png
I'm getting a 200 returned so I know I'm getting the data from the webpage, but sometimes it does not go into parse_dir_contents, or it only grabs info on a few shirts and then moves on to the next keyword without following pagination.
Working with two keywords: the first keyword in my file (keywords.txt) is loaded, it may find 1-3 shirts, then it moves on to the next keyword. The second keyword is then completely successful, finding all shirts and following pagination. In a keyword file with 5+ keywords, the first 2-3 keywords are skipped, then the next keyword is loaded and only 2-3 shirts are found before it moves onto the next word which is again completely successful. In a file with 10+ keywords I get very sporadic behavior.
I have no idea why this is happening can anyone explain?