I am trying to scrape info on shirts from Amazon. My spider currently accepts a list of keywords and uses them to perform a search on Amazon. For each search page I call the parse function. I want to grab each of the resulting items and further inspect them using scrapy's "reponse.follow(...)" method.
I am currently trying to do this using "response.css('.s-result-item')" to get all the results. I have also tried using "response.css('.sg-col-inner'). Either way, it gets some of the results but not all of them, and sometimes it will only get two or three per page. If I add .extract() to the statement it completely fails. Here is my parse method:
def parse(self, response):
print("========== starting parse ===========")
print(response.text)
all_containers = response.css(".s-result-item")
for shirts in all_containers:
next_page = shirts.css('.a-link-normal::attr(href)').extract_first()
if next_page is not None:
if "https://www.amazon.com" not in next_page:
next_page = "https://www.amazon.com" + next_page
yield response.follow('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)
second_page = response.css('li.a-last a::attr(href)').get()
if second_page is not None and AmazonSpiderSpider.page_number < 3:
AmazonSpiderSpider.page_number += 1
yield response.follow('http://api.scraperapi.com/?api_key=mykey&url='+ second_page, callback=self.parse)
else:
AmazonSpiderSpider.current_keyword = AmazonSpiderSpider.current_keyword + 1
I am new to Python and Scrapy, I do not know if I should be using reponse.follow or scrapy.Request, or if that would even make a difference. Any ideas?