1

I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine.

I've attempted to follow this fairly thorough answer here: How to create custom Scrapy Item Exporter?

but when I run the file I can see it gather the data in the shell but it does not output it at all.

Below I've copied in order: Exporter, Pipeline, Settings,

Exporter:

from scrapy.exporters import JsonItemExporter

class XYZExport(JsonItemExporter):

    def __init__(self, file, **kwargs):
        super().__init__(file)

    def start_exporting(self):
        self.file.write(b)

    def finish_exporting(self):
        self.file.write(b)

I'm struggling to determine what goes in the self.file.write parentheses?

Pipeline:

from exporters import XYZExport

class XYZExport(object):
    def __init__(self, file_name):
        self.file_name = file_name
        self.file_handle = None

    @classmethod
    def from_crawler(cls, crawler):
        output_file_name = crawler.settings.get('FILE_NAME')

        return cls(output_file_name)

    def open_spider(self, spider):
        print('Custom export opened')

        file = open(self.file_name, 'wb')
        self.file_handle = file


        self.exporter = XYZExport(file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        print('Custom Exporter closed')

        self.exporter.finish_exporting()

        self.file_handle.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Settings:

FILE_NAME = 'C:\Apps Ive Built\WebScrape Python\XYZ\ScrapeOutput.json'
ITEM_PIPELINES = {
      'XYZ.pipelines.XYZExport' : 600,
}

I hope/am afraid its a simple omission because that seems to be my MO, but I'm very new to scraping and this is the first time I've tried to do it this way.

If there is a more stable way to export this data I'm all ears, otherwise can you tell me what I've missed, that is preventing the data from being exported? or preventing the exporter from being properly called.

[Edited to change the pipeline name in settings]

Artie
  • 82
  • 1
  • 8
  • see code from your link - `self.file.write(b'{\'product\': [')` - `b` is prefix which means `bytes` data - `b"{'product': ["`. It is not variable. If you use `b` in `write(b)` then you should get error and it may not work correctly. BTW: other prefixes: `u"text"` = unicode data, `r"C:\"` = raw data and you can safely use \ instead of \\ in path. – furas Dec 07 '17 at 01:54
  • Thanks for the explanation, I assumed the '{\'product\': [' portion was specific to the question being answered and wasn't sure how to translate it to my project. I had deleted that and never added it back before asking my question. I cant seem to locate good documentation that explans what to put there. Beyond that I never recieved an error which makes me think I'm not calling the pipeline correctly because it never tried to Write with the incorrect variable. – Artie Dec 07 '17 at 02:10
  • Edited: So I've made a couple additional tweaks, I've edited the settings to say XYZ.pipelines.XYZExport, and in my crawler process script I imported and added the utility get_project_settings(); However now I'm receiving the following error in my shell; AttributeError: 'XYZExport' object has no attribute 'start_exporting' – Artie Dec 07 '17 at 02:46
  • you have one big mistake - you can't have two object with the same name - `XYZExport` from `exporters` and class `XYZExport`. Now your class `XYZExport` removes exporter `XYZExport` so inside class `XYZExport` you use class instead exporter in `self.exporter = XYZExport(file)`. Better use `import exporters` and `self.exporter = exporters.XYZExport(file)` – furas Dec 07 '17 at 13:41

0 Answers0