4

I have two (web) event tables in BigQuery which are partitioned by a DATE column named _date.

One of the tables does not require the partition filter (tableA), the other does (tableB).

When I configure my Great Expectations datasource config I do specify that I want to have batches split by this column. However the one that requires the filter gives an error that the filter was not provided. I'm having the hardest time seeing in the docs how I could force this with a ConfiguredAssetSqlDataConnector, being the most specific connector I can use for BQ.

Here's what the connector looks like for both tableA and tableB, just swap out the table names:

table_name = "the_dataset.tableA"
bq_table_datasource_config = {
            "name": f"{table_name}_bigquery_datasource",
            "class_name": "Datasource",
            "module_name": "great_expectations.datasource",
            "execution_engine": {
                "module_name": "great_expectations.execution_engine",
                "class_name": "SqlAlchemyExecutionEngine",
                "connection_string": "bigquery://our-bigquery-project",
            },
            "data_connectors": {
                "daily": {
                    "class_name": "ConfiguredAssetSqlDataConnector",
                    "include_schema_name": True,
                    "assets": {
                        f"our-bigquery-project.{table_name}": {
                            "splitter_method": "split_on_year_and_month_and_day",
                            "splitter_kwargs": {
                                "column_name": "_date"
                            },
                        }
                    },
                },
            },
        }

So after building a Great Expectations context named context I can put that into:

context.test_yaml_config(
  yaml.dump(bq_table_datasource_config),
  return_mode="report_object",
  shorten_tracebacks=True,
)

This results in success with tableA and an error with tableB.

Attempting to instantiate class from config...
    Instantiating as a Datasource, since class_name is Datasource
    Successfully instantiated Datasource


ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
    default_configured_data_connector_name : ConfiguredAssetSqlDataConnector

    Available data_asset_names (1 of 1):
        our-bigquery-projet.the_dataset.tableA (3 of 101): [{'_date': {'year': 2022, 'month': 11, 'day': 21}}, {'_date': {'year': 2022, 'month': 11, 'day': 22}}, {'_date': {'year': 2022, 'month': 11, 'day': 23}}]

    Unmatched data_references (0 of 0):[]

VS

Attempting to instantiate class from config...
    Instantiating as a Datasource, since class_name is Datasource
    Successfully instantiated Datasource


ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 203, in _execute
    self._query_job.result()
google.api_core.exceptions.BadRequest: 400 Cannot query over table 'our-bigquery-project.the_dataset.tableB' without a filter over column(s) '_date' that can be used for partition elimination

Location: US
Job ID: 00516791-f350-4f82-a8a1-d2e60bc08ec5


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
google.cloud.bigquery.dbapi.exceptions.DatabaseError: 400 Cannot query over table 'our-bigquery-project.the_dataset.tableB' without a filter over column(s) '_date' that can be used for partition elimination

Location: US
Job ID: 00000000-ffff-4444-aaaa-121212121212


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/config_validator/yaml_config_validator.py", line 266, in test_yaml_config
    report_object: dict = instantiated_class.self_check(
sqlalchemy.exc.DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 Cannot query over table 'our-bigquery-project.the_dataset.tableB' without a filter over column(s) '_date' that can be used for partition elimination

Location: US
Job ID: 00000000-ffff-4444-aaaa-121212121212

[SQL: SELECT distinct(concat(concat(concat(%(concat_1:STRING)s, CAST(EXTRACT(year FROM `_date`) AS STRING)), CAST(EXTRACT(month FROM `_date`) AS STRING)), CAST(EXTRACT(day FROM `_date`) AS STRING))) AS `concat_distinct_values`, CAST(EXTRACT(year FROM `_date`) AS INT64) AS `year`, CAST(EXTRACT(month FROM `_date`) AS INT64) AS `month`, CAST(EXTRACT(day FROM `_date`) AS INT64) AS `day` 
FROM our-bigquery-project.the_dataset.tableB]
[parameters: {'concat_1': ''}]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

I'm only validating the data, I don't get to change the schema nor required partition filter. It's normally a good thing that the filter is required.

dlamblin
  • 43,965
  • 20
  • 101
  • 140
  • There's a now deleted answer below that claimed that the `splitter_kwargs` would take a `filter` key with a clause value. It did not, giving: `TypeError: get_data_for_batch_identifiers_year_and_month_and_day() got an unexpected keyword argument 'filter'` And the docs don't mention other splitter_kwargs. https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/datasource_configuration/how_to_configure_a_sql_datasource/#splitting-methods – dlamblin Mar 14 '23 at 00:11

1 Answers1

0

Current status as per discussion with users and developers on the Great Expectations Slack is that this continues to be a known issue. There was a closed issue ticket for it, being unresolved, it may be reopened soon.

dlamblin
  • 43,965
  • 20
  • 101
  • 140