We want to integrate data quality checks in our ETL pipelines, and tried this with Great Expectations. All our ETL is in PySpark. For small datasets, this is all well, but for larger ones the performance of Great Expectations is really bad. On a 350GB dataset (Stored in Delta), it took 1h30 to validate on 6 columns if there are no Null values. Are we doing something wrong? We dit it in two ways. The difference is not exactly clear, but for both ways it takes the same time. The processing is done on Databricks
First:
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
dq_cols = ["col_1", "col_2", "col_3", "col_4", "col_5", "col_6"]
df = spark.read.load(...)
ge_df = SparkDFDataset(df)
for col in dq_cols:
ge_df.expect_column_values_to_not_be_null(col)
ge_df.validate()
Second:
df = spark.read.load(...)
context = gx.get_context()
dataframe_datasource = context.sources.add_or_update_spark(name="my_spark_name")
dataframe_asset = dataframe_datasource.add_dataframe_asset(name="my_dataframe_asset_name",dataframe=df)
batch_request = dataframe_asset.build_batch_request()
expectations_object = load_yaml_expectations("...") # Our own function. Reads yaml with expectations
suite = context.add_or_update_expectation_suite(
expectation_suite_name="my_expectation_suite_name",
expectations=expectations_object,
)
checkpoint = Checkpoint(
name="my_checkpoint_name",
# run_name_template="my_template_name",
data_context=context,
batch_request=batch_request,
expectation_suite_name="my_expectation_suite_name",
action_list=[
{
"name": "validate_expectations",
"action": {"class_name": "StoreValidationResultAction"},
}
],
)
result = checkpoint.run()