3

I want to scrape an wordpress site with scrapy. My problem is that I want the heading, text, date and author. The author data is not printed on the main article and the whole text is not in the short version. So i have to copy author first then visit the full version of the post to get the text. I cant figure out how to send data from two urls to the same csv line.

So i want to visit https://www.exemple.me/news/page/1/ copy author --> go to first post copy heading, date and text --> store the data to an csv (author,heading,date,text,) --> go back to https://www.exemple.me/news/page/1/ and do the same thing with second post and so on..

I know how to use selectors so my problem is that i cant store data from two urls to same line..

I can do it with selenium and BeautifulSoup but want to learn how to do in in scrapy to

Nisse Karlsson
  • 139
  • 2
  • 15

1 Answers1

6

You can use cb_kwargs to pass author information:

import scrapy

class WordpressSpider(scrapy.Spider):

    name = "wp"
    start_urls = ['https://www.wordpresssite.com']

    def parse(self, response):
        for article in response.xpath('//article/selector'):
            author = article.xpath('./author/selector').get()
            article_url = article.xpath('./article/url/selector').get()
            yield scrapy.Request(
                url=article_url,
                callback=self.parse_article,
                cb_kwargs={
                    'author': author,
                }
            )

    def parse_article(self, response, author):
        title = response.xpath('//title/selector').get()
        date = response.xpath('//date/selector').get()
        text = response.xpath('//text/selector').get()
        yield {
            'title': title,
            'date': date,
            'text': text,
            'author': author
        }
gangabass
  • 10,607
  • 2
  • 23
  • 35