1

I am using Scrapy's xml feed spider sitemap to crawl and extract urls and only urls.

The xml sitemap looks like this:

<url>
<loc>
https://www.example.com/american-muscle-5-pc-kit-box.html
</loc>
<lastmod>2020-10-14T15:40:02+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
<image:image>
<image:loc>
https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg
</image:loc>
<image:title>
5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE
</image:title>
</image:image>
<PageMap>
<DataObject type="thumbnail">
<Attribute name="name" value="5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE"/>
<Attribute name="src" value="https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg"/>
</DataObject>
</PageMap>
</url>

I ONLY want to get the contents of the <loc></loc>

So I set my scrapy spider up like this (some parts omitted for brevity):

start_urls = ['https://www.example.com/sitemap.xml']
    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'url'

def parse_node(self, response, selector):
    item = {}
    item['url'] = selector.select('url').get()
    selector.remove_namespaces()
    yield {
        'url': selector.xpath('//loc/text()').getall()
    }

That ends up givin me the url and url for all the product images. How can I set this spider up to ONLY get the actual product page url?

user3125823
  • 1,846
  • 2
  • 18
  • 46

1 Answers1

1

In order to change this part of sitemap spider logic it is required to override It's _parse_sitemap method (source)
and replace section

    elif s.type == 'urlset':
        for loc in iterloc(it, self.sitemap_alternate_links):
            for r, c in self._cbs:
                if r.search(loc):
                    yield Request(loc, callback=c)
                    break

by something like this

    elif s.type == 'urlset':
        for entry in it:
            item = entry #entry - sitemap entry parsed as dictionary by Sitemap spider
            ...
            yield item # instead of making request - return item

In this case spider should return items from parsed sitemap entries instead of making requests for every link

Georgiy
  • 3,158
  • 1
  • 6
  • 18