1

I currently am trying to use the Python Data validation package 'Great Expectations'.

I am currently using the GreatExpectationsOperator to call an expectation suite on a particular datasource (a PostgreSQL datasource).

my_ge_task = GreatExpectationsOperator(
    task_id='my_task',
    expectation_suite_name='suite.error',
    batch_kwargs={
        'table': 'data_quality',
        'datasource': 'data_quality_datasource',
        'query': "SELECT * FROM data_qualityWHERE batch='abc';"
    },
    data_context_root_dir=ge_root_dir
)

What I'm trying to figure out is how to store and get my datasource credentials. For other operations using PostgreSQL, I have been using a PostgreSQL connection to store the database credentials and using the PostgreSQL hook to interact with the database. However with great expectations, the postgreSQL connection details are stored inside the Great expectations context in the config_variables.yaml. I have tried using ENV variables when creating my dockerfile and using these as the credentials and it works but I am trying to find a cleaner way, by possibly using my existing PostgreSQL connection details to use for a datasource.

There doesnt seem to be much details online on how to accomplish this so any help would be very very much appreciated.

Thanks,

adan11
  • 647
  • 1
  • 7
  • 24
  • 1
    You should check the [connections](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html) from the doc. That's the way to use credentials. – vdolez Aug 18 '21 at 09:20

1 Answers1

0

One of possible workarounds is to use GreatExpectationOperator inside of PythonOperator, so that before running GE, script extracts connection data from Airflow Connection and saves it as environment variable.

Something like that:

import os

from airflow.hooks.base import BaseHook

from great_expectations_provider.operators.great_expectations import (
    GreatExpectationsOperator,
)

def get_ge_runner(task_id, checkpoint_name, connection_name):
    def run_ge(ds, **kwargs):
        connection = BaseHook.get_connection(connection_name)
        os.environ[
            "my_db_creds"
        ] = f"postgresql+psycopg2://{connection.login}:{connection.password}@{connection.host}:{connection.port}/{connection.schema}"
        op = GreatExpectationsOperator(
            task_id=task_id,
            data_context_root_dir=ge_root_dir,
            run_name=task_id,
            checkpoint_name=checkpoint_name,
        )
        op.execute(kwargs)

    return run_great_expectations
Ramil Gataullin
  • 131
  • 2
  • 5