3
with open('/home/timmy/myamazon/bannedasins.txt') as f:
    banned_asins = f.read().split('\n')

class AmazonSpider(CrawlSpider):

    name = 'amazon'
    allowed_domains = ['amazon.com',]

    rules = (
            Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
            Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
            process_value= lambda i:f"https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}"),
            callback="parse_item"),
            )

I have the following two rules to extract Amazon product Links which works correct,Now I want to remove some Asins from search re.search('dp/(.*)/',i).groups()[0] this retrieves the ASIN and places it in the format https://www.amazon.com/dp/{ASIN}, what I want to do is-- if asin in banned_asins do not extract it.

After reading Link Extractors Scrapy doc,I believe its done by deny_extensions not sure how to use though

banned_asins= ['B07RTX74L7','B07D9JCH5X',......]
programmerwiz32
  • 529
  • 1
  • 5
  • 20
  • you can make a new list of links from the ASINs you have and then ban the new links – wishmaster Jul 21 '19 at 23:31
  • Rule(LinkExtractor(deny_extensions= banned_asins) ? That seems like the obvious answer, have you tried that? If so what was the result? – pjmaracs Jul 26 '19 at 12:32
  • @pjmaracs as you can see I don't get the ASIN directly I have to use regex to find it which takes priorirty over deny_extensions – programmerwiz32 Jul 26 '19 at 15:48
  • @programmerwiz32 thaks for the clarification. What in the doc do you see that process_value takes priority? Does it also take priority over deny – pjmaracs Jul 26 '19 at 18:27

1 Answers1

1

deny_extensions won't work, it refers to common file extensions that are not followed if they occur in links, see here for the default values if it's not given.

You just filter out the banned asins in your process_value function. If it returns None, the given link is ignored:

process_value (callable)

a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.

So it should be:

def process_value(i):
    asin = re.search('dp/(.*)', i).groups()[0]
    return f"https://www.amazon.com/dp/{asin}" if asin not in banned_asins else None

....

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
        Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
            process_value=process_value), callback="parse_item"),
        )
Stef
  • 28,728
  • 2
  • 24
  • 52
  • what do you mean by 'no longer works?' Do you get errors? Maybe the structure of the amazon.com site has changed and you'll have to adapt your `restrict_xpaths`. In order to understand what's going on you can place a `print(i)` in your `process_value` function. – Stef Aug 03 '19 at 17:42
  • no I believe "process_value" not longer works it must be lambda function – programmerwiz32 Aug 04 '19 at 17:36
  • no, this is impossible: a named function is exactly equivalent to a lambda function, it makes absolutely no difference whether the functions is named or a lambda function. Did you test it with print(i) insides the process_value function to see if it is called and what arguments are passed to it? – Stef Aug 04 '19 at 18:17
  • yes, please try it, I was shocked when It just skipped it, even if you comment out the function it doesn't matter and call it in process_value (no error happens) – programmerwiz32 Aug 05 '19 at 18:26
  • you can try it on scrapy shell of course that's where I am trying, and see the extracted_links – programmerwiz32 Aug 06 '19 at 14:48
  • you're right: you must place the process_value function *before* the rules (outside your spider class) and remove the quotation marks, i.e. `process_value=process_value)`, I updated my anser – Stef Aug 07 '19 at 06:36