0

I'm currently building a Python backend that will be deployed item's list to S3 bucket in the format I need using Scrapinghub service and Scrapy module.

I successfully log into the website. Made looping through pages and started yielding items in the next format on each page. I provided you with the way I yield items.

spider.py

def parse(self, response):

    links = response.selector.xpath('//a[@class="item"]/@href').getall()
    names = response.selector.xpath('//a[@class="item"]/text()').getall()

    yield { 'links': links,'names': names }

In the file pipelines.py I made custom JSON Pipeline to receive json file in the next format:

{
    "list_of_objects": [
            {
                "links": "link",
                "names": "name"
            },
            {...},
            {...},
            ...
    ]
}

pipelines.py

class JsonWriterPipeline(object):
    items_list = []

    def open_spider(self, spider):
        self.file = open('items.json', 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    def process_item(self, item, spider):

        self.items_list.append(item)
        return item

    def close_spider(self, spider):

        def_dict = defaultdict(list)
        def_dict['list_of_objects'] = self.items_list

        self.exporter.export_item(def_dict)
        self.exporter.finish_exporting()
        self.file.close()

First of all I run this spider locally and I get the right format that I was expecting. Then I deployed spider into Scrapinghub, make configuration in the project to upload to AWS S3 bucket.

I was able to run spider on Scrapinghub and then I looked to the output json result on the AWS S3 bucket. I was able to find the file and the AWS, but the format wasn't that I was looking for.

[
    {
         "links": "link",
         "names": "name"
    },
    {...},
    {...},
    ...
]

Have you any suggestions why file wasn't on the S3 bucket in the format I was expecting?

IK KLX
  • 121
  • 2
  • 14
  • 1
    Is `ddict` the same as `def_dict`? – JQadrad May 28 '20 at 18:02
  • @JQadrad Yeah, they are the same. I edited the question. – IK KLX May 28 '20 at 18:04
  • Your pipeline writes into a local file. Did you modify the code to upload that file to S3? – Gallaecio Jun 01 '20 at 10:52
  • 1
    I think what you may be looking for, instead of writing a pipeline, is writing a custom item exporter (e.g. subclassing [JsonItemExporter](https://docs.scrapy.org/en/latest/topics/exporters.html#jsonitemexporter)) and configuring it in [FEED_EXPORTERS](https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORTERS). – Gallaecio Jun 01 '20 at 10:55

0 Answers0