I am using Scrapy's xml feed spider sitemap to crawl and extract urls and only urls.
The xml sitemap looks like this:
<url>
<loc>
https://www.example.com/american-muscle-5-pc-kit-box.html
</loc>
<lastmod>2020-10-14T15:40:02+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
<image:image>
<image:loc>
https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg
</image:loc>
<image:title>
5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE
</image:title>
</image:image>
<PageMap>
<DataObject type="thumbnail">
<Attribute name="name" value="5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE"/>
<Attribute name="src" value="https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg"/>
</DataObject>
</PageMap>
</url>
I ONLY want to get the contents of the <loc></loc>
So I set my scrapy spider up like this (some parts omitted for brevity):
start_urls = ['https://www.example.com/sitemap.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'url'
def parse_node(self, response, selector):
item = {}
item['url'] = selector.select('url').get()
selector.remove_namespaces()
yield {
'url': selector.xpath('//loc/text()').getall()
}
That ends up givin me the url and url for all the product images. How can I set this spider up to ONLY get the actual product page url?