Use a pickled pandas dataframe as a data asset in great_expectations

Question

Probably a very simple question but I could not figure it out from the documentation of great_expectations. I would like to run some tests on a pandas dataframe that is stored locally as a pickled file '.pkl'.

When I ran great_expectations add-datasource it ignored the .pkl files and only created assets for .csv files. Reading csv files from pandas is slow, so it would be great if GE could support other formats like pickle and HDF.

How to load .pkl or .hdf files as GE's assets?

I'm using v0.8.7 :)

Could you please add which version are you using @Manu? – aylr Jan 16 '20 at 16:36 — aylr, Jan 16 '20 at 16:36

James · Accepted Answer · 2020-04-30T23:12:33.840

For pandas (and spark), there is a good general-purpose approach for having full control over how the data is read, which is to specify an already-available dataframe via your BatchKwargs.

So, in your case, you could do the following:

my_dataset = pd.read_pickle(filename)
batch_kwargs = {"dataset": my_dataset}
batch = context.get_batch("my_datasource/in_memory_generator/my_dataset", "warning", batch_kwargs)

Note: this is for the 0.8.x series API, and assumes a data context configuration like the following:

datasources:
  my_datasource:
    class_name: PandasDatasource
    ...
    generators:
      in_memory_generator:
        class_name: InMemoryGenerator

PS - This purpose is the primary reason for the existence of the InMemoryGenerator.

EDIT

In Great Expectations >= 0.9.0, the API for get_batch has been simplified, so you would no longer need a generator at all in this case, and the datasource name is specified in the batch kwargs. The analogous code snippet looks like this:

context = DataContext()
my_dataset = pd.read_pickle(filename)
batch_kwargs = {"datasource": "my_datasource", "dataset": my_dataset}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="warning")

(and no generator is needed)

Thanks James it works great. Just a precision, before calling Jame's code, we need to create a datacontext and expectation suite like so: ```context = GE.data_context.DataContext(); context.create_expectation_suite( data_asset_name="my_datasource/in_memory_generator/my_dataset", expectation_suite_name="warning", overwrite_existing=True) ``` — Manu, Jan 17 '20 at 15:30

Use a pickled pandas dataframe as a data asset in great_expectations

1 Answers1

Linked