Scrapy: Parsing data for one variable directly from the start url and data for other variables after following all the href from the start url?

Question

How can I go about parsing data for one variable directly from the start url and data for other variables after following all the href from the start url? The web page I want to scrape has a list of articles with the "category", "title", "content", "author" and "date" data. In order to scrape the data, I followed all "href" on the start url which redirect to the full article and parsed the data. However, the "category" data is not always available when individual article is opened/followed from the "href" in the start url so it ends up having missing data for some observations. Now, I'm trying to scrape just the "category" data directly from the start url which has the "category" data for all article listings (no missing data). How should I go about parsing "category" data? How should I take care of the parsing and callback? The "category" data is circle in red in the image

class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']

def parse(self, response):
    for link in response.css('.card-title'):
        yield response.follow(link, callback=self.parse_newsletter)

def parse_newsletter(self, response):
    item = CoindeskItem()
    item['category'] = response.css('.kjyoaM::text').get()
    item['headline'] = response.css('.fPbJUO::text').get()
    item['content_summary'] = response.css('.jPQVef::text').get()
    item['authors'] = response.css('.dWocII a::text').getall()
    item['published_date'] = response.css(
        '.label-with-icon .fUOSEs::text').get()
    yield item

score 0 · Answer 1 · answered Oct 18 '22 at 03:42

You can use the cb_kwargs argument to pass data from one parse callback to another. To do this you would need to grab the value of the category for the corresponding link to the full article. This can be done by simply iterating through any element that encompasses both the category and the link and pulling the information from both out of said element.

Here is an example based on the code you provided, this should work the way you described.

class BtcNewsletterSpider(scrapy.Spider):
    name = 'btc_spider'
    allowed_domains = ['www.coindesk.com']
    start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']

    def parse(self, response):
        for card in response.xpath("//div[contains(@class,'articleTextSection')]"):
            item = CoindeskItem()
            item["category"] = card.xpath(".//a[@class='category']//text()").get()
            link = card.xpath(".//a[@class='card-title']/@href").get()
            yield response.follow(
                link,
                callback=self.parse_newsletter,
                cb_kwargs={"item": item}
            )

    def parse_newsletter(self, response, item):
        item['headline'] = response.css('.fPbJUO::text').get()
        item['content_summary'] = response.css('.jPQVef::text').get()
        item['authors'] = response.css('.dWocII a::text').getall()
        item['published_date'] = response.css(
            '.label-with-icon .fUOSEs::text').get()
        yield item

Thank you, Alexander! How should I handle the next page/pagination in this case? — hareko, Oct 18 '22 at 08:37
@hareko you can probably just add 1 to the page number at the end of the url to navigate the pages... — Alexander, Oct 19 '22 at 07:55

Scrapy: Parsing data for one variable directly from the start url and data for other variables after following all the href from the start url?

1 Answers1