1

I'm using the Great Expectations python package (version 0.14.10) to validate some data. I've already followed the provided tutorials and created a great_expectations.yml in the local ./great_expectations folder. I've also created a great expectations suite based on a .csv file version of the data (call this file ge_suite.json).

GOAL: I want to use the ge_suite.json file to validate an in-memory pandas DataFrame.

I've tried following this SO question answer with code that looks like this:

import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext

context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")

My datasources section of my great_expectations.yml file looks like this:

datasources:
  my_datasource_name:
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: PandasExecutionEngine
    module_name: great_expectations.datasource
    class_name: Datasource
    data_connectors:
      default_inferred_data_connector_name:
        module_name: great_expectations.datasource.data_connector
        base_directory: /tmp
        class_name: InferredAssetFilesystemDataConnector
        default_regex:
          group_names:
            - data_asset_name
          pattern: (.*)
      default_runtime_data_connector_name:
        batch_identifiers:
          - default_identifier_name
        module_name: great_expectations.datasource.data_connector
        class_name: RuntimeDataConnector

When I run the batch = context.get_batch(... command in python I get the following error:

File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
  return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
  batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'

I'm assuming that I need to add something to the definition of the datasource in the great_expectations.yml file to fix this. Or, could it be a versioning issue? I'm not sure. I looked around for a while in the online documentation and didn't find an answer. How do I achieve the "GOAL" (defined above) and get past this error?

Jed
  • 1,823
  • 4
  • 20
  • 52

1 Answers1

0

If you want to validate an in-memory pandas dataframe you can reference the following 2 pages for information on how to do that:

https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/

https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_create_a_batch_of_data_from_an_in_memory_spark_or_pandas_dataframe/

To give a concrete example in code though, you can do something like this:

import great_expectations as ge
import os
import pandas as pd
from great_expectations.core.batch import RuntimeBatchRequest

context = ge.get_context()
df = pd.read_pickle('/path/to/my/df.pkl')

suite_name = 'ge_suite'
data_asset_name = 'your_data_asset_name'
batch_id = 'your_batch_id'

batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name", 
                                    data_connector_name="default_runtime_data_connector_name",
                                    data_asset_name=data_asset_name,
                                    runtime_parameters={"batch_data": df},
                                    batch_identifiers={"default_identifier_name": batch_id}, )

# context.run_checkpoint method looks for checkpoint file on disk.  Create one...
checkpoint_name = 'your_checkpoint_name'
checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
checkpoint_yml = f'''
name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {suite_name}
'''
with open(checkpoint_path, 'w') as f:
    f.write(checkpoint_yml)

result = context.run_checkpoint(
    checkpoint_name=checkpoint_name,
    validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
)
Jed
  • 1,823
  • 4
  • 20
  • 52