1

I scrap news site. For every news, there are content, and many comments. I have 2 Item, one for content, and other for multiple comments. Problem is content and multiple comments yield as different request. I want news' content and its multiple comments should yield or return as together or as one. Pipeline timing or order is not matter for me.

In Items file:

class NewsPageItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    hour = scrapy.Field()
    image = scrapy.Field()
    image_url = scrapy.Field()
    top_content = scrapy.Field()
    parag = scrapy.Field()
    #comments = scrapy.Field()
    comments_count = scrapy.Field()

class CommentsItem(scrapy.Item):
    id_ = scrapy.Field()
    username = scrapy.Field()
    firstname = scrapy.Field()
    lastname = scrapy.Field()
    email = scrapy.Field()
    ip = scrapy.Field()
    userid = scrapy.Field()
    date = scrapy.Field()
    comment_text = scrapy.Field()
    comment_type_id = scrapy.Field()
    object_id = scrapy.Field()
    yes = scrapy.Field()
    no = scrapy.Field()

In Spider, News content, and its many comments are not connected:

class NewsSpider(scrapy.Spider):
    ...

    def parse(self, response):
        for nl in news_links:
            yield scrapy.Request(url=nl, callback=self.new_parse)
            yield scrapy.Request(url=url, callback=self.comment_parse)

    def new_parse(self,response):
        item = BigParaItem()
        item['title'] = response.xpath(...).extract()
        ...
        yield item

    def comment_parse(self,response):
        data = json.loads(response.body.decode('utf8'))

        for comment in data.get('data', []):
            item = CommentsItem()
            item['id_'] = comment.get('Id')
            ...
            yield item

Pipelines:

class NewsPagePipeline(object):
    def process_item(self, item, spider):
        return item

class CommentsPipeline(object):
    def process_item(self, item, spider):
        return item

How can I connect items or be nested when yield?

rene
  • 41,474
  • 78
  • 114
  • 152

1 Answers1

1

It's better to chain requests and pass news item between callbacks to populate it with the comments using meta(*):

class NewsSpider(scrapy.Spider):
...

def parse(self, response):
    for nl in news_links:
        yield scrapy.Request(url=nl, callback=self.new_parse, meta={'comments_url': url})

def new_parse(self,response):
    item = BigParaItem()
    item['title'] = response.xpath(...).extract()
    item['comments'] = []
    ...
    yield scrapy.Request(response.meta['comments_url'], callback=self.comment_parse, meta={'item': item})

def comment_parse(self,response):
    data = json.loads(response.body.decode('utf8'))
    item = response.meta['item']
    for comment in data.get('data', []):
        c_item = CommentsItem()
        c_item['id_'] = comment.get('Id')
        ...
        item['comments'].append(c_item)
    yield item
mizhgun
  • 1,758
  • 15
  • 14