0

I am trying to scrape info on shirts from Amazon. My spider currently accepts a list of keywords and uses them to perform a search on Amazon. For each search page I call the parse function. I want to grab each of the resulting items and further inspect them using scrapy's "reponse.follow(...)" method.

I am currently trying to do this using "response.css('.s-result-item')" to get all the results. I have also tried using "response.css('.sg-col-inner'). Either way, it gets some of the results but not all of them, and sometimes it will only get two or three per page. If I add .extract() to the statement it completely fails. Here is my parse method:

def parse(self, response):
    print("========== starting parse ===========")
    print(response.text)
    all_containers = response.css(".s-result-item")
    for shirts in all_containers:
        next_page = shirts.css('.a-link-normal::attr(href)').extract_first()
        if next_page is not None:
            if "https://www.amazon.com" not in next_page:
                next_page = "https://www.amazon.com" + next_page
            yield response.follow('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)

    second_page = response.css('li.a-last a::attr(href)').get()
    if second_page is not None and AmazonSpiderSpider.page_number < 3:
        AmazonSpiderSpider.page_number += 1
        yield response.follow('http://api.scraperapi.com/?api_key=mykey&url='+ second_page, callback=self.parse)
    else:
        AmazonSpiderSpider.current_keyword = AmazonSpiderSpider.current_keyword + 1

I am new to Python and Scrapy, I do not know if I should be using reponse.follow or scrapy.Request, or if that would even make a difference. Any ideas?

nyedidikeke
  • 6,899
  • 7
  • 44
  • 59

1 Answers1

0

I have accomplished this using:

for next_page in response.css("h2.a-size-mini a").xpath("@href").extract():

  • please look at my other question: https://stackoverflow.com/questions/57760431/why-is-scrapy-skipping-some-urls-but-not-others – Conrad Dubois Sep 04 '19 at 19:40