I am using Great Expectations to create data quality tests on intermediate featuresets in a pyspark featureset-generation pipeline. The intermediate featuresets are therefore stored in thousands of .snappy.parquet files to support the distributed compute resources.
I am able to create a datasource using the yaml
intermediate_df_datasource_config = {
"name": "Spark_source",
"class_name": "Datasource",
"execution_engine": {"class_name": "SparkDFExecutionEngine"},
"data_connectors": {
"dt_partitioned_intermediate_featuresets": {
"class_name": "InferredAssetGCSDataConnector",
"bucket_or_name": "[bucket]",
"prefix": "[prefix]",
"default_regex": {
"pattern": "[prefix](.*)_df/dt=(\d{4})-(\d{2})-(\d{2})(.*)\\.snappy\\.parquet",
"group_names": ["data_asset_name","year", "month", "day", "partition"],
},
},
}
This creates a datasource that looks like
Available data_asset_names (1 of 1):
intermediate_features (3 of 2011): [list of file names in data asset]
So I know it can see all of the 2011 parquet files in this data asset.
I then go on to generate some expectations of this data asset, creating a BatchRequest that looks like this:
batch_request = BatchRequest(
datasource_name="Spark_source",
data_connector_name="dt_partitioned_intermediate_featuresets",
data_asset_name="intermediate_features",
batch_spec_passthrough={"reader_method": "parquet"}
)
The issue arises when I use this BatchRequest to power a validator. I hand it to the validator and then call validator.active_batch_definition
. This shows me:
{
"datasource_name": "Spark_source",
"data_connector_name": "dt_partitioned_intermediate_featuresets",
"data_asset_name": "intermediate_features",
"batch_identifiers": {
"year": "2022",
"month": "03",
"day": "06",
"partition": "[rest of file name]"
}
}
Which has batch_identifiers that clearly point to one single file, and when I develop against the Validator with something like:
validator.expect_table_row_count_to_be_between(column="[colname]", min_value=0, max_value=1000000)
It comes back with:
{...
"result": {
"observed_value": 24 #(e.g. observed 24 rows)
},
"success": true
}
Which shows it is really only looking at the single file (which has 24 rows).
I get the same result when I hand a similarly configured BatchRequest to context.add_checkpoint()
.
I want to be able to check at least 100 files (or more) so doing multiple batches is really inefficient. So that brings me to my question: How do I get the batch/validator/checkpoint to utilize more than a single file when developing and checking expectations?