I have two (web) event tables in BigQuery which are partitioned by a DATE column named _date
.
One of the tables does not require the partition filter (tableA), the other does (tableB).
When I configure my Great Expectations datasource config I do specify that I want to have batches split by this column. However the one that requires the filter gives an error that the filter was not provided. I'm having the hardest time seeing in the docs how I could force this with a ConfiguredAssetSqlDataConnector
, being the most specific connector I can use for BQ.
Here's what the connector looks like for both tableA and tableB, just swap out the table names:
table_name = "the_dataset.tableA"
bq_table_datasource_config = {
"name": f"{table_name}_bigquery_datasource",
"class_name": "Datasource",
"module_name": "great_expectations.datasource",
"execution_engine": {
"module_name": "great_expectations.execution_engine",
"class_name": "SqlAlchemyExecutionEngine",
"connection_string": "bigquery://our-bigquery-project",
},
"data_connectors": {
"daily": {
"class_name": "ConfiguredAssetSqlDataConnector",
"include_schema_name": True,
"assets": {
f"our-bigquery-project.{table_name}": {
"splitter_method": "split_on_year_and_month_and_day",
"splitter_kwargs": {
"column_name": "_date"
},
}
},
},
},
}
So after building a Great Expectations context named context
I can put that into:
context.test_yaml_config(
yaml.dump(bq_table_datasource_config),
return_mode="report_object",
shorten_tracebacks=True,
)
This results in success with tableA and an error with tableB.
Attempting to instantiate class from config...
Instantiating as a Datasource, since class_name is Datasource
Successfully instantiated Datasource
ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
default_configured_data_connector_name : ConfiguredAssetSqlDataConnector
Available data_asset_names (1 of 1):
our-bigquery-projet.the_dataset.tableA (3 of 101): [{'_date': {'year': 2022, 'month': 11, 'day': 21}}, {'_date': {'year': 2022, 'month': 11, 'day': 22}}, {'_date': {'year': 2022, 'month': 11, 'day': 23}}]
Unmatched data_references (0 of 0):[]
VS
Attempting to instantiate class from config...
Instantiating as a Datasource, since class_name is Datasource
Successfully instantiated Datasource
ExecutionEngine class name: SqlAlchemyExecutionEngine
Data Connectors:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 203, in _execute
self._query_job.result()
google.api_core.exceptions.BadRequest: 400 Cannot query over table 'our-bigquery-project.the_dataset.tableB' without a filter over column(s) '_date' that can be used for partition elimination
Location: US
Job ID: 00516791-f350-4f82-a8a1-d2e60bc08ec5
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
self.dialect.do_execute(
google.cloud.bigquery.dbapi.exceptions.DatabaseError: 400 Cannot query over table 'our-bigquery-project.the_dataset.tableB' without a filter over column(s) '_date' that can be used for partition elimination
Location: US
Job ID: 00000000-ffff-4444-aaaa-121212121212
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/great_expectations/data_context/config_validator/yaml_config_validator.py", line 266, in test_yaml_config
report_object: dict = instantiated_class.self_check(
sqlalchemy.exc.DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 Cannot query over table 'our-bigquery-project.the_dataset.tableB' without a filter over column(s) '_date' that can be used for partition elimination
Location: US
Job ID: 00000000-ffff-4444-aaaa-121212121212
[SQL: SELECT distinct(concat(concat(concat(%(concat_1:STRING)s, CAST(EXTRACT(year FROM `_date`) AS STRING)), CAST(EXTRACT(month FROM `_date`) AS STRING)), CAST(EXTRACT(day FROM `_date`) AS STRING))) AS `concat_distinct_values`, CAST(EXTRACT(year FROM `_date`) AS INT64) AS `year`, CAST(EXTRACT(month FROM `_date`) AS INT64) AS `month`, CAST(EXTRACT(day FROM `_date`) AS INT64) AS `day`
FROM our-bigquery-project.the_dataset.tableB]
[parameters: {'concat_1': ''}]
(Background on this error at: https://sqlalche.me/e/14/4xp6)
I'm only validating the data, I don't get to change the schema nor required partition filter. It's normally a good thing that the filter is required.