Great Expectations bad performance on PySpark DataFrame

Question

We want to integrate data quality checks in our ETL pipelines, and tried this with Great Expectations. All our ETL is in PySpark. For small datasets, this is all well, but for larger ones the performance of Great Expectations is really bad. On a 350GB dataset (Stored in Delta), it took 1h30 to validate on 6 columns if there are no Null values. Are we doing something wrong? We dit it in two ways. The difference is not exactly clear, but for both ways it takes the same time. The processing is done on Databricks

First:

from great_expectations.dataset.sparkdf_dataset import SparkDFDataset

dq_cols = ["col_1", "col_2", "col_3", "col_4", "col_5", "col_6"]
df = spark.read.load(...)
ge_df = SparkDFDataset(df)
for col in dq_cols:
    ge_df.expect_column_values_to_not_be_null(col)
ge_df.validate()

Second:

df = spark.read.load(...)
context = gx.get_context()
dataframe_datasource = context.sources.add_or_update_spark(name="my_spark_name")
dataframe_asset = dataframe_datasource.add_dataframe_asset(name="my_dataframe_asset_name",dataframe=df)
batch_request = dataframe_asset.build_batch_request()
expectations_object = load_yaml_expectations("...") # Our own function. Reads yaml with expectations
suite = context.add_or_update_expectation_suite(
        expectation_suite_name="my_expectation_suite_name",
        expectations=expectations_object,
    )
checkpoint = Checkpoint(
        name="my_checkpoint_name",
        # run_name_template="my_template_name",
        data_context=context,
        batch_request=batch_request,
        expectation_suite_name="my_expectation_suite_name",
        action_list=[
            {
                "name": "validate_expectations",
                "action": {"class_name": "StoreValidationResultAction"},
            }
        ],
    )

result = checkpoint.run()

I'm with the great-expectations team...this does seem like a bug and I've been able to reproduce it. We've got an engineer who is looking into the metrics performance on databricks soon and we'll hope to get some good news your way. — James, Aug 03 '23 at 22:16
@James Thanks for this response. How will the findings be communicated? It it tracked in GitHub? — gamezone25, Aug 10 '23 at 09:42
Hey! My team's been using Great Expectations and for cases like this we created a library: https://github.com/SuperiorityComplex/data_checks/tree/main that is more flexible since you can use any Python code to get the data and also check the data — josh, Aug 30 '23 at 17:43

Great Expectations bad performance on PySpark DataFrame

0 Answers0