In great_expectations, I am trying to add a checkpoint to a context. The batch of data refers to a csv file stored on s3 having a semicolumn as separator. I am loading the batch using PySpark as connector. I tried with the following code:
First I defined a batch request to retrieve data. Here is recommended to use batch_spec_passthrough to specify the separator:
batch_request = {
"datasource_name": "my_datasource",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "my_data_asset",
"batch_spec_passthrough": {'reader_options': {'sep': ';', 'header': 'true', 'inferSchema': 'true'}},
}
br = BatchRequest(**batch_request)
Then I specified which expectation suite I want to use:
expectation_suite_name = "my_expectation_suite"
At this point I defined the checkpoint:
checkpoint_name = "my_checkpoint"
checkpoint_config = {
"class_name": "SimpleCheckpoint",
"validations": [
{
"batch_request": br,
"expectation_suite_name": expectation_suite_name
}
]
}
checkpoint = SimpleCheckpoint(
f"{checkpoint_name}__{expectation_suite_name}",
context,
**checkpoint_config
)
And finally I added the checkpoint to the data context, and ran it.
checkpoint_json = checkpoint.get_config().to_json_dict()
context.add_checkpoint(**checkpoint_json)
context.run_checkpoint(checkpoint_name=checkpoint_name)
The problem is that when I ran the checkpoint, I obtained an error which indicates the batch data is not loaded in the right way. None of the expectations work because it seems the batch data was not loaded by using the semicolumn as separator, and so produced a single column dataframe. This is one of the errors I see from running the checkpoint:
"exception_message": "Error: The column "GR" in BatchData does not exist."
However, when I do not add the checkpoint to the context but just run the checkpoint everything is fine. So the following command works:
checkpoint_result = checkpoint.run()
Any help?