I'm currently building a Python backend that will be deployed item's list to S3 bucket in the format I need using Scrapinghub service and Scrapy module.
I successfully log into the website. Made looping through pages and started yielding items in the next format on each page. I provided you with the way I yield items.
spider.py
def parse(self, response):
links = response.selector.xpath('//a[@class="item"]/@href').getall()
names = response.selector.xpath('//a[@class="item"]/text()').getall()
yield { 'links': links,'names': names }
In the file pipelines.py I made custom JSON Pipeline to receive json file in the next format:
{
"list_of_objects": [
{
"links": "link",
"names": "name"
},
{...},
{...},
...
]
}
pipelines.py
class JsonWriterPipeline(object):
items_list = []
def open_spider(self, spider):
self.file = open('items.json', 'wb')
self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
self.exporter.start_exporting()
def process_item(self, item, spider):
self.items_list.append(item)
return item
def close_spider(self, spider):
def_dict = defaultdict(list)
def_dict['list_of_objects'] = self.items_list
self.exporter.export_item(def_dict)
self.exporter.finish_exporting()
self.file.close()
First of all I run this spider locally and I get the right format that I was expecting. Then I deployed spider into Scrapinghub, make configuration in the project to upload to AWS S3 bucket.
I was able to run spider on Scrapinghub and then I looked to the output json result on the AWS S3 bucket. I was able to find the file and the AWS, but the format wasn't that I was looking for.
[
{
"links": "link",
"names": "name"
},
{...},
{...},
...
]
Have you any suggestions why file wasn't on the S3 bucket in the format I was expecting?