2

I am using Great Expectations to create data quality tests on intermediate featuresets in a pyspark featureset-generation pipeline. The intermediate featuresets are therefore stored in thousands of .snappy.parquet files to support the distributed compute resources.

I am able to create a datasource using the yaml

intermediate_df_datasource_config = {
    "name": "Spark_source",
    "class_name": "Datasource",
    "execution_engine": {"class_name": "SparkDFExecutionEngine"},
    "data_connectors": {
        "dt_partitioned_intermediate_featuresets": {
            "class_name": "InferredAssetGCSDataConnector",
            "bucket_or_name": "[bucket]",
            "prefix": "[prefix]",
            "default_regex": {
                "pattern": "[prefix](.*)_df/dt=(\d{4})-(\d{2})-(\d{2})(.*)\\.snappy\\.parquet",
                "group_names": ["data_asset_name","year", "month", "day", "partition"],
            },
        }, 
}

This creates a datasource that looks like

Available data_asset_names (1 of 1):
        intermediate_features (3 of 2011): [list of file names in data asset]

So I know it can see all of the 2011 parquet files in this data asset.

I then go on to generate some expectations of this data asset, creating a BatchRequest that looks like this:

batch_request = BatchRequest(
    datasource_name="Spark_source",
    data_connector_name="dt_partitioned_intermediate_featuresets",
    data_asset_name="intermediate_features",
    batch_spec_passthrough={"reader_method": "parquet"}
)

The issue arises when I use this BatchRequest to power a validator. I hand it to the validator and then call validator.active_batch_definition. This shows me:

{
  "datasource_name": "Spark_source",
  "data_connector_name": "dt_partitioned_intermediate_featuresets",
  "data_asset_name": "intermediate_features",
  "batch_identifiers": {
    "year": "2022",
    "month": "03",
    "day": "06",
    "partition": "[rest of file name]"
  }
}

Which has batch_identifiers that clearly point to one single file, and when I develop against the Validator with something like:

validator.expect_table_row_count_to_be_between(column="[colname]", min_value=0, max_value=1000000)

It comes back with:

{...
"result": {
    "observed_value": 24 #(e.g. observed 24 rows)
  },
  "success": true
}

Which shows it is really only looking at the single file (which has 24 rows).

I get the same result when I hand a similarly configured BatchRequest to context.add_checkpoint().

I want to be able to check at least 100 files (or more) so doing multiple batches is really inefficient. So that brings me to my question: How do I get the batch/validator/checkpoint to utilize more than a single file when developing and checking expectations?

Stod
  • 63
  • 1
  • 6
  • 1
    hello, have you found an answer to your own question? if so, please share it because I'm facing the same issue – Omar Nov 08 '22 at 12:14
  • 1
    @omar no solution yet. Opted to push to BigQuery for a check. Still looking for a better fix. – Stod Nov 09 '22 at 16:25

0 Answers0