How can I go about parsing data for one variable directly from the start url and data for other variables after following all the href from the start url? The web page I want to scrape has a list of articles with the "category", "title", "content", "author" and "date" data. In order to scrape the data, I followed all "href" on the start url which redirect to the full article and parsed the data. However, the "category" data is not always available when individual article is opened/followed from the "href" in the start url so it ends up having missing data for some observations. Now, I'm trying to scrape just the "category" data directly from the start url which has the "category" data for all article listings (no missing data). How should I go about parsing "category" data? How should I take care of the parsing and callback? The "category" data is circle in red in the image
class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']
def parse(self, response):
for link in response.css('.card-title'):
yield response.follow(link, callback=self.parse_newsletter)
def parse_newsletter(self, response):
item = CoindeskItem()
item['category'] = response.css('.kjyoaM::text').get()
item['headline'] = response.css('.fPbJUO::text').get()
item['content_summary'] = response.css('.jPQVef::text').get()
item['authors'] = response.css('.dWocII a::text').getall()
item['published_date'] = response.css(
'.label-with-icon .fUOSEs::text').get()
yield item