How to create Great Expectations checkpoint for Pandas dataframe?

Question

My datasource config looks like:

datasource_config = {
    "name": "example_datasource",
    "class_name": "Datasource",
    "module_name": "great_expectations.datasource",
    "execution_engine": {
        "module_name": "great_expectations.execution_engine",
        "class_name": "PandasExecutionEngine",
    },
    "data_connectors": {
        "default_runtime_data_connector_name": {
            "class_name": "RuntimeDataConnector",
            "module_name": "great_expectations.datasource.data_connector",
            "batch_identifiers": ["default_identifier_name"],
        },
    },
}
context.add_datasource(**datasource_config)

My Pandas dataframe and batch_requests were successfully created by following commands:

...
df = read_csv_pandas(file_path="../done/my_file.txt", 
                           sep="|", 
                           header=0,
                           quoting=csv.QUOTE_ALL)

batch_request = RuntimeBatchRequest(
datasource_name="example_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="MyDataAsset",
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": "default_identifier"}
)

My expectation suite:

expectation_suite_name = "My_validations"
suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)

Then I'm creating the validator.

validator = context.get_validator(
    batch_request=batch_request, expectation_suite_name=expectation_suite_name
)
validator.head(2)

The last command successfully prints 2 rows of my dataframe.

Then I'm adding expectations to my suite.

validator.expect_table_columns_to_match_ordered_list(['last_name', 'first_name', 'sex'])
validator.expect_column_values_to_be_in_set("sex", ["male", "female", "other", "unknown"])
validator.save_expectation_suite(discard_failed_expectations=False)

Then I'm generating data docs:

suite_identifier = ExpectationSuiteIdentifier(expectation_suite_name=expectation_suite_name)
context.build_data_docs(resource_identifiers=[suite_identifier])
context.open_data_docs(resource_identifier=suite_identifier)

My checkpoint looks like:

name: my_checkpoint_2
config_version: 1
class_name: SimpleCheckpoint
validations:
    - batch_request:
        datasource_name: example_datasource
        data_connector_name: default_runtime_data_connector_name
        data_asset_name: MyDataAsset
        runtime_parameters:
          batch_data: {df}
        batch_identifiers:
          default_identifier_name: default_identifier
expectation_suite_name: My_validations

But this command

context.run_checkpoint(checkpoint_name="my_checkpoint_2")

produces the error:

ValueError: RuntimeDataBatchSpec must provide a Pandas DataFrame or PandasBatchData object.

score 0 · Answer 1 · answered Nov 22 '22 at 16:56

0

Great expectations has multiple execution engines. You are specifying the PandasExecutionEngine. The execution engine should be changed to SparkDFExecutionEngine or you should cast your dataframe to Pandas.

answered Nov 22 '22 at 16:56

Evie Cameron

1

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 24 '22 at 14:36

How to create Great Expectations checkpoint for Pandas dataframe?

1 Answers1