Is there a way to customize Scrapy JSONlines exporer to not include Null/Default values?

Question

I'm building some web scrapers using Scrapy with Pydantic. We are currently using the JSONlines item exporter to output the data into a file. Here is an example of a JSON line created by the scraper.

{
  "timestamp": null, 
  "deposit_date": "2022-01-14", 
  "secondary_date": null, 
  "termination_date": "2024-01-12", 
  "tax_structure": "UNKNOWN", 
  "initial_pop": "10.00", 
  "initial_liq": null, 
  "term": "Y02", 
  "narrative_objective": "The trust seeks to provide ....", 
  "narrative_inv_strategy": "",
  "narrative_selection": "", 
  "narrative_risks": ""
}

The fields marked with null or an empty string are default values provided from the model when the scraper doesn't find the field/value on the page. The issue is that these default values override values input from other sources (manual input, for instance). I would like the output to not include these empty fields so they can be manually populated later.

Desired output:

{ 
  "deposit_date": "2022-01-14", 
  "termination_date": "2024-01-12", 
  "tax_structure": "UNKNOWN", 
  "initial_pop": "10.00", 
  "term": "Y02", 
  "narrative_objective": "The trust seeks to provide ....", 
}

One possible solution is to change the model to only include fields that are scraped. I'd like to avoid doing this. I am building similar scrapers for 4 different sites and would like to avoid having to make 4+ different models. Even pages on the same site have/don't have different fields depending on the product. The solution I'd like to implement is to customize the feed exporter to not include these "empty" fields so they can be manually populated later. I've read through Scrapy's docs on Feed exports but would like a bit more detail on how to go about this.

Any help would be appreciated.

score 0 · Answer 1 · answered Jun 26 '22 at 02:56

Another possible solution would be to use item pipelines. In your scrapy project in the pipelines.py file you could filter out any keys that have improper values and/or drop an item altogether, in case it has no proper fields.

pipelines.py

from scrapy.exceptions import DropItem

class SpidersPipeline:

    def process_item(self, item, spider):
        new_item = {k:v for k,v in item.items() if v not in [None, ""]}
        if len(new_item) == 0:
            raise DropItem
        return new_item

Then in your settings.py file uncomment the item_pipelines dictionary.

Is there a way to customize Scrapy JSONlines exporer to not include Null/Default values?

1 Answers1