Questions tagged [great-expectations]

Great Expectations is an open source software that helps teams promote analytic integrity by offering a unique approach to data pipeline testing. Pipeline tests are applied to data (instead of code) and at batch time (instead of compile or deploy time). Pipeline tests are like unit tests for datasets: they help you guard against upstream data changes and monitor data quality. In addition to pipeline testing GE also provides data documentation/profiling

131 questions
1
vote
0 answers

Validating datasets produced by identical Apache airflows

I have the same workflow on two different environments. To validate that both workflows are identical, I feed the same input data to both workflows. If they are identical, I am expecting the output dataset of each workflow to be same. In this…
1
vote
1 answer

Use Great Expectations to validate pandas DataFrame with existing suite JSON

I'm using the Great Expectations python package (version 0.14.10) to validate some data. I've already followed the provided tutorials and created a great_expectations.yml in the local ./great_expectations folder. I've also created a great…
Jed
  • 1,823
  • 4
  • 20
  • 52
1
vote
1 answer

How do I pass multiple CSVs with custom delimiter to a great_expetctation checkpoint

I am trying to run great_expectation checkpoint on 10 CSV files with "|" delimiter. Currently, I have to specify this all in a YAML file and that after converting my files from "|" delimiter to ",". How can run this for multiple files without…
Ravi
  • 35
  • 9
1
vote
0 answers

Can Apache Great Expectation segregate good and bad records?

I am using Great Expectations in my ETL data pipeline for a POC. I have a validation which is failing (as expected), and I have the following data in my validation JSON: "unexpected_count": 205, "unexpected_percent": 10.25, …
Kuwali
  • 233
  • 3
  • 13
1
vote
0 answers

great_expectations add checkpoint with batch_spec_passthrough

In great_expectations, I am trying to add a checkpoint to a context. The batch of data refers to a csv file stored on s3 having a semicolumn as separator. I am loading the batch using PySpark as connector. I tried with the following code: First I…
aprospero
  • 529
  • 3
  • 14
1
vote
1 answer

unable to initialize snowflake data source

I am trying to access the snowflake datasource using "great_expectations" library. The following is what I tried so far: from ruamel import yaml import great_expectations as ge from great_expectations.core.batch import BatchRequest,…
1
vote
0 answers

Great expectation validation results operations

Is there a way to split the data a batch in two streams of data: one for which the expectations are met The second one for which expectations fail That is to split the tested batch of data into two table/pandas data frames? one that is clean and…
MariaMadalina
  • 479
  • 6
  • 20
1
vote
0 answers

great_expectations data validation on Cassandra

I have multiple tables in a Cassandra keyspace. I want to use Great Expectations to validate my data. I've been trying to use Spark to load data from Cassandra and I was able to create RuntimeBatchRequest using Spark dataframes. However I need to…
alit8
  • 41
  • 1
  • 3
1
vote
1 answer

Airflow - Great Expectations - Getting/Setting config variables

I currently am trying to use the Python Data validation package 'Great Expectations'. I am currently using the GreatExpectationsOperator to call an expectation suite on a particular datasource (a PostgreSQL datasource). my_ge_task =…
adan11
  • 647
  • 1
  • 7
  • 24
1
vote
2 answers

How to create a Python Wheel Or Determine what modules / libraries are within a Python Wheel

I am trying to create a Python Wheel for Great_Expectations. The .whl provided by Great_Expectations exists here https://pypi.org/project/great-expectations/#files - great-expectations 0.13.25. Unfortunately, it appears that this .whl doesn't…
Patterson
  • 1,927
  • 1
  • 19
  • 56
1
vote
0 answers

great_expectations and scrapy

When I am using a project with great_expectations and scrapy there seem to be errors that somehow conflict. When I uninstall either of these libraries everything works fine, but using both there are some errors. Here is my stack trace, but I can not…
Ben Muller
  • 221
  • 1
  • 4
  • 10
1
vote
1 answer

How to pass a CustomDataAsset to a DataContext to run custom expectations on a batch?

I have a CustomPandasDataset with a custom expectation from great_expectations.data_asset import DataAsset from great_expectations.dataset import PandasDataset from datetime import date, datetime, timedelta class…
Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49
1
vote
1 answer

How to import Great Expectations custom datasource ValueError: no package specified for (required for relative module names)

I have this folder structure for my Great Expectations project: great_expectations/ dataset/ __init__.py oracle_dataset.py datasource/ __init__.py oracle_datasource.py …
Pierre Delecto
  • 455
  • 1
  • 7
  • 26
1
vote
1 answer

How to acces output folder from a PythonScriptStep?

I'm new to azure-ml, and have been tasked to make some integration tests for a couple of pipeline steps. I have prepared some input test data and some expected output data, which I store on a 'test_datastore'. The following example code is a…
1
vote
1 answer

Unable to set up data source as aws s3 via cli and test_yaml_config in great_expections

great_expectations setup: Created a new virtual environment Installed required packages: pip install boto3 pip install fsspec pip install s3fs Updated data source in configuration: great_expectations.yml datasources: pandas_s3: class_name:…
Mohanraj N
  • 11
  • 1
1 2 3
8 9