I'm building some web scrapers using Scrapy with Pydantic. We are currently using the JSONlines item exporter to output the data into a file. Here is an example of a JSON line created by the scraper.
{
"timestamp": null,
"deposit_date": "2022-01-14",
"secondary_date": null,
"termination_date": "2024-01-12",
"tax_structure": "UNKNOWN",
"initial_pop": "10.00",
"initial_liq": null,
"term": "Y02",
"narrative_objective": "The trust seeks to provide ....",
"narrative_inv_strategy": "",
"narrative_selection": "",
"narrative_risks": ""
}
The fields marked with null or an empty string are default values provided from the model when the scraper doesn't find the field/value on the page. The issue is that these default values override values input from other sources (manual input, for instance). I would like the output to not include these empty fields so they can be manually populated later.
Desired output:
{
"deposit_date": "2022-01-14",
"termination_date": "2024-01-12",
"tax_structure": "UNKNOWN",
"initial_pop": "10.00",
"term": "Y02",
"narrative_objective": "The trust seeks to provide ....",
}
One possible solution is to change the model to only include fields that are scraped. I'd like to avoid doing this. I am building similar scrapers for 4 different sites and would like to avoid having to make 4+ different models. Even pages on the same site have/don't have different fields depending on the product. The solution I'd like to implement is to customize the feed exporter to not include these "empty" fields so they can be manually populated later. I've read through Scrapy's docs on Feed exports but would like a bit more detail on how to go about this.
Any help would be appreciated.