2

I am using Kedro to create a pipeline for ETL purposes and column specific validations are being done using Great-Expectations. There is a hooks.py file listed in Kedro documentation here. This hook is registered as per the instructions mentioned on Kedro-docs.

This is my current workflow: The workflow:

  1. Created a kedro project using kedro new, project name ecom_analytics
  2. Stored the datasets in data/01_raw folder called dataset_raw.csv & dataset_validate.csv
  3. Initialize great_expectations project using great_expectations init
  4. Create a new datasource using great_expectations datasource new. The name I added was main_datasource
  5. Create a new expectation using great_expectations suite new. This expectation is called data.raw using data assistant
  6. Edited the great_expectations suite using great_expectations suite edit data.raw
  7. Created the catalog entries for the datasets in data/01_raw
  8. Added the Great expectations hooks.py' given in the kedro documentation and registered the hook on settings.py` file
  9. Tried kedro viz --autoreload. This works to view the visualisation
  10. When using kedro run it gives the error
│ /opt/conda/lib/python3.9/site-packages/great_expectations/data_context/data_context/abstract_dat │
│ a_context.py:758 in get_batch                                                                    │
│                                                                                                  │
│    755 │   │   else:                                                                             │
│    756 │   │   │   data_asset_type = arg3                                                        │
│    757 │   │   batch_parameters = kwargs.get("batch_parameters")                                 │
│ ❱  758 │   │   return self._get_batch_v2(                                                        │
│    759 │   │   │   batch_kwargs=batch_kwargs,                                                    │
│    760 │   │   │   expectation_suite_name=expectation_suite_name,                                │
│    761 │   │   │   data_asset_type=data_asset_type,                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.9/site-packages/great_expectations/data_context/data_context/abstract_dat │
│ a_context.py:867 in _get_batch_v2                                                                │
│                                                                                                  │
│    864 │   │   │   expectation_suite = self.get_expectation_suite(expectation_suite_name)        │
│    865 │   │                                                                                     │
│    866 │   │   datasource = self.get_datasource(batch_kwargs.get("datasource"))  # type: ignore  │
│ ❱  867 │   │   batch = datasource.get_batch(  # type: ignore[union-attr]                         │
│    868 │   │   │   batch_kwargs=batch_kwargs, batch_parameters=batch_parameters                  │
│    869 │   │   )                                                                                 │
│    870 │   │   if data_asset_type is None:                                                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Datasource' object has no attribute 'get_batch'

Please use the latest develop branch for the following project to look through the issue : https://github.com/DhavalThkkar/ecom-analytics

This is extremely difficult to work with. I have loaded the dataset for which I want to check validations inside the data/01_raw folder. If someone can help me with an end-2-end example for this repo, it'd really be appreciated

2 Answers2

2

You can check a minimal example here: https://github.com/erwinpaillacan/kedro-great-expectations-example

Basically, you need to define:

  • A memory dataset, which is already defined in the example
  • Define your expectations
  • Define your checkpoint linked to your expectation.
  • A mapper dataset: checkpoint conf/base/parameters/great_expectations_hook.yml
0

You need to create the datasource. For more information (and example code to resolve a very similar issue), see https://github.com/great-expectations/great_expectations/issues/1389#issuecomment-624955813.

deepyaman
  • 538
  • 5
  • 16
  • This doesn't address the issue. I have updated the description with the exact workflow which I have used again to land at the same issue again – Dhaval Thakkar Dec 21 '22 at 10:06