with open('/home/timmy/myamazon/bannedasins.txt') as f:
banned_asins = f.read().split('\n')
class AmazonSpider(CrawlSpider):
name = 'amazon'
allowed_domains = ['amazon.com',]
rules = (
Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
process_value= lambda i:f"https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}"),
callback="parse_item"),
)
I have the following two rules to extract Amazon product Links which works correct,Now I want to remove some Asins from search re.search('dp/(.*)/',i).groups()[0]
this retrieves the ASIN and places it in the format https://www.amazon.com/dp/{ASIN}
, what I want to do is-- if asin in banned_asins
do not extract it.
After reading Link Extractors Scrapy doc,I believe its done by deny_extensions
not sure how to use though
banned_asins= ['B07RTX74L7','B07D9JCH5X',......]