Why is Scrapy skipping some URL's but not others?

Question

I am writing a scrapy crawler to grab info on shirts from Amazon. The crawler starts on an amazon page for some search, "funny shirts" for example, and collects all the result item containers. It then parses through each result item collecting data on the shirts.

I use ScraperAPI and Scrapy-user-agents to dodge amazon. The code for my spider is:

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    page_number = 2

    keyword_file = open("keywords.txt", "r+")
    all_key_words = keyword_file.readlines()
    keyword_file.close()
    all_links = []
    keyword_list = []

    for keyword in all_key_words:
        keyword_list.append(keyword)
        formatted_keyword = keyword.replace('\n', '')
        formatted_keyword = formatted_keyword.strip()
        formatted_keyword = formatted_keyword.replace(' ', '+')
        all_links.append("http://api.scraperapi.com/?api_key=mykeyd&url=https://www.amazon.com/s?k=" + formatted_keyword + "&ref=nb_sb_noss_2")

    start_urls = all_links

def parse(self, response):
    print("========== starting parse ===========")

    all_containers = response.css(".s-result-item")
    for shirts in all_containers:
        next_page = shirts.css('.a-link-normal::attr(href)').extract_first()
        if next_page is not None:
            if "https://www.amazon.com" not in next_page:
                next_page = "https://www.amazon.com" + next_page
            yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)

    second_page = response.css('li.a-last a::attr(href)').get()
    if second_page is not None and AmazonSpiderSpider.page_number < 3:
        AmazonSpiderSpider.page_number += 1
        yield response.follow(second_page, callback=self.parse)


def parse_dir_contents(self, response):
    items = ScrapeAmazonItem()

    print("============= parsing page ==============")

    temp = response.css('#productTitle::text').extract()
    product_name = ''.join(temp)
    product_name = product_name.replace('\n', '')
    product_name = product_name.strip()

    temp = response.css('#priceblock_ourprice::text').extract()
    product_price = ''.join(temp)
    product_price = product_price.replace('\n', '')
    product_price = product_price.strip()

    temp = response.css('#SalesRank::text').extract()
    product_score = ''.join(temp)
    product_score = product_score.strip()
    product_score = re.sub(r'\D', '', product_score)

    product_ASIN = re.search(r'(?<=/)B[A-Z0-9]{9}', response.url)
    product_ASIN = product_ASIN.group(0)

    items['product_ASIN'] = product_ASIN
    items['product_name'] = product_name
    items['product_price'] = product_price
    items['product_score'] = product_score

    yield items

Crawling looks like this:

https://i.stack.imgur.com/UbVUt.png

I'm getting a 200 returned so I know I'm getting the data from the webpage, but sometimes it does not go into parse_dir_contents, or it only grabs info on a few shirts and then moves on to the next keyword without following pagination.

Working with two keywords: the first keyword in my file (keywords.txt) is loaded, it may find 1-3 shirts, then it moves on to the next keyword. The second keyword is then completely successful, finding all shirts and following pagination. In a keyword file with 5+ keywords, the first 2-3 keywords are skipped, then the next keyword is loaded and only 2-3 shirts are found before it moves onto the next word which is again completely successful. In a file with 10+ keywords I get very sporadic behavior.

I have no idea why this is happening can anyone explain?

It's probable amazon is detecting you're running a crawler and returning bogus data — Francisco, Sep 02 '19 at 17:01
Thanks for your comment, but how is it doing this when I am running through a proxy and using randomized middleware — Conrad Dubois, Sep 02 '19 at 18:21
Did you verify that you are getting the correct data by printing? Maybe add print(response.text) or something to verify you are getting the correct page. I tried scraper api few months back and sometimes even with 200 it seemed to return empty pages with bunch of js scripts. — Amit, Sep 03 '19 at 08:32
I added "print(response.text)" and it looks like the pages returned contain the actual HTML data. Is there something wrong with my code maybe? Are my loops for following pagination or crawling shirts somehow failing when I have a list of keywords? The search is ALWAYS successful when I only have 1 keyword in the file. One thing I have noticed is that it will print out the response text for multiple pages before it gets to the "=========Parsing Page=========" — Conrad Dubois, Sep 03 '19 at 15:18
I have added print statements to make sure it is creating all the links correctly. I have tried using scrapy.Request and response.follow interchangeably. I assume it has to do with my yields but i don't know why — Conrad Dubois, Sep 04 '19 at 19:35
I have tried adding priority to my next_page loop as is shown here: https://stackoverflow.com/questions/6566322/scrapy-crawl-urls-in-order — Conrad Dubois, Sep 04 '19 at 20:16

Manuel · Answer 1 · 2019-09-06T14:37:04.193

0

first check if robots.txt is being ignored, from what you have said I suppose you already have that.

Sometimes the html code returned from the response is not the same as the one you are seeing when you look at the product. I dont really know what exactly is going on in your case but you can check what the spider is actually "reading" with.

scrapy shell 'yourURL'

After that

view(response)

There you can check out the code that the spider is actually seeing if the requests succeeds.

Sometimes the request does not succeed (Maybe Amazon is redirecting you to a CAPTCHA or something).

You can check the response while scraping with (Please check the code below, Im doing this from memory)

import request

#inside your parse method

r = request.get("url")
print(r.content)

If I remember correctly, you can get the URL from scrapy itself (something along the lines of response.url.

edited Sep 06 '19 at 14:37

answered Sep 05 '19 at 13:24

Manuel

730
7
25

Thank you for your response, I have confirmed with this method that I am getting the correct HTML from amazon. I have been looking at this page, but it is a little over my head: https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do – Conrad Dubois Sep 06 '19 at 13:49
Thats a little hard for me too. Another thing that could help is seeing the response you actually get from the amazon page (Not by inserting the URL but the ACTUAL response). Ill update answer for that – Manuel Sep 06 '19 at 14:30
Thank you, I have checked and I am getting the correct page from Amazon. The url is exactly as it should be- when it works. The issue is that for some start url's parse_dir_contents only gets called 2-3 times. Other url's it will crawl the entire page of results and I get 50-100 shirts. – Conrad Dubois Sep 06 '19 at 15:10
I dont really see anything wrong with your code. I had problems while scraping amazon as you describe. Mainly in the first pages I scrape. For example, I only get like 3 products from the first 2 pages but then I get a bunch for the following ones. I havent found a way to normalize this yet. – Manuel Sep 06 '19 at 18:13
Maybe, just maybe, its about duplicates. I guess you are storing your data somewhere so it could be the case that shirts you have already scraped are not being scraped again (I think scrapy has an optin for this). Apart from that, im out of ideas – Manuel Sep 06 '19 at 18:15
Hmmm, thats very odd that we have similar problems. What doesn't make sense is that if I run the same spider on only a single search it performs perfectly fine. Add a second keyword to the file and suddenly it only returns a few results for the first – Conrad Dubois Sep 09 '19 at 16:38

score 0 · Answer 2 · answered Sep 16 '19 at 12:36

Try to make use of dont_filter=True in your scrapy Requests. I had the same problem, seemed like the scrapy crawler was ignoring some URLs because it thought they were duplicate.

dont_filter=True

This makes sure that scrapy doesn't filter any URLS with its dupefilter.

Why is Scrapy skipping some URL's but not others?

2 Answers2

Linked