I am using Kedro to create a pipeline for ETL purposes and column specific validations are being done using Great-Expectations. There is a hooks.py
file listed in Kedro documentation here. This hook is registered as per the instructions mentioned on Kedro-docs.
This is my current workflow: The workflow:
- Created a kedro project using
kedro new
, project nameecom_analytics
- Stored the datasets in
data/01_raw
folder calleddataset_raw.csv
&dataset_validate.csv
- Initialize great_expectations project using
great_expectations init
- Create a new datasource using
great_expectations datasource new
. The name I added wasmain_datasource
- Create a new expectation using
great_expectations suite new
. This expectation is called data.raw using data assistant - Edited the great_expectations suite using
great_expectations suite edit data.raw
- Created the catalog entries for the datasets in
data/01_raw
- Added the Great expectations
hooks.py' given in the kedro documentation and registered the hook on
settings.py` file - Tried
kedro viz --autoreload
. This works to view the visualisation - When using
kedro run
it gives the error
│ /opt/conda/lib/python3.9/site-packages/great_expectations/data_context/data_context/abstract_dat │
│ a_context.py:758 in get_batch │
│ │
│ 755 │ │ else: │
│ 756 │ │ │ data_asset_type = arg3 │
│ 757 │ │ batch_parameters = kwargs.get("batch_parameters") │
│ ❱ 758 │ │ return self._get_batch_v2( │
│ 759 │ │ │ batch_kwargs=batch_kwargs, │
│ 760 │ │ │ expectation_suite_name=expectation_suite_name, │
│ 761 │ │ │ data_asset_type=data_asset_type, │
│ │
│ /opt/conda/lib/python3.9/site-packages/great_expectations/data_context/data_context/abstract_dat │
│ a_context.py:867 in _get_batch_v2 │
│ │
│ 864 │ │ │ expectation_suite = self.get_expectation_suite(expectation_suite_name) │
│ 865 │ │ │
│ 866 │ │ datasource = self.get_datasource(batch_kwargs.get("datasource")) # type: ignore │
│ ❱ 867 │ │ batch = datasource.get_batch( # type: ignore[union-attr] │
│ 868 │ │ │ batch_kwargs=batch_kwargs, batch_parameters=batch_parameters │
│ 869 │ │ ) │
│ 870 │ │ if data_asset_type is None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'Datasource' object has no attribute 'get_batch'
Please use the latest develop branch for the following project to look through the issue : https://github.com/DhavalThkkar/ecom-analytics
This is extremely difficult to work with. I have loaded the dataset for which I want to check validations inside the data/01_raw
folder. If someone can help me with an end-2-end example for this repo, it'd really be appreciated