3

I am fairly new to Great Expectations - and have a question. Essentially I have a PostgreSQL database, and every time I run my data pipeline, i want to validate a specific subset of the PostgreSQL table based off some key. Eg: If the data pipeline is run every day, the would be a field called current_batch. And the validation would occur for the below query:

SELECT * FROM jobs WHERE current_batch = <input_batch>.

I am unsure the best way to complete this. I am a using v3-api of great expectations and am a bit confused as to whether to use a checkpoint, or a validator. I assume I want to use a checkpoint but I can't seem to figure out how to create a checkpoint, but then only validate a specific subset of the PostgreSQL datasource.

Any help or guidance would be much appreciated.

Thanks,

adan11
  • 647
  • 1
  • 7
  • 24

4 Answers4

1

I completely understand your confusion because I am working with GE too and the documentation is not really clear.

First of all "Validators" are now called "Checkpoints", so they are not a different entity, as you can read here.

I am working on an Oracle database and the only way I found to apply a query before testing my data with expectations is to put the query inside the checkpoint.

To create a checkpoint you should run the great_expectations checkpoint new command from your terminal. After creating it, you should add the "query" field inside the .yml file that is your checkpoint.

Below you can see a snippet of a checkpoint I am working with. When I want to validate my data, I run the command great_expectations checkpoint run check1

name: check1
module_name: great_expectations.checkpoint
class_name: LegacyCheckpoint
batches:
  - batch_kwargs:
      table: pso
      schema: test
      query: SELECT p AS c,

             [ ... ]

       AND    lsr = c)

      datasource: my_database
      data_asset_name: test.pso

    expectation_suite_names:
      - exp_suite1

Hope this helps! Feel free to ask if you have any doubts :)

Lorenzo
  • 180
  • 8
  • 1
    Isn't your answer based on V2 though? From what I see the way to initialize v3 Checkpoint is `great_expectations --v3-api checkpoint new my_checkpoint` – sirVir Nov 15 '21 at 11:55
  • 1
    Yes, you are righ, but it works the same way. I tried with both the commands and the results are identical. @sirVir – Lorenzo Nov 24 '21 at 10:37
1

for current GE V3 API your checkpoint config file with SQL SELECT might look like this, tested for GE v0.16.1: (example for snowflake, \n in SQL due to conversion from multiline python variable)

name: trial.sf_chkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template: Full
expectation_suite_name:
batch_request: {}
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: store_evaluation_params
    action:
      class_name: StoreEvaluationParametersAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
evaluation_parameters: {}
runtime_configuration: {}
validations:
  - batch_request:
      datasource_name: sflake_trial
      data_connector_name: default_runtime_data_connector_name
      data_asset_name: DW.table_name
      runtime_parameters:
        query: "\nSELECT * \nfrom testdb.public.customer_demographics \nLIMIT 10000\n"
      batch_identifiers:
        default_identifier_name: default_identifier
    expectation_suite_name: trial.sf
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:
alex
  • 11
  • 3
0

I managed this using Views (in Postgres). Before running GE, I create (or replace the existing) view as a query with all necessary joins, filtering, aggregations, etc. And then specify the name of this view in GE checkpoints.

Yes, it is not the ideal solution. I would rather use a query in checkpoints too. But as a workaround, it covers all my cases.

Let's have view like this:

CREATE OR REPLACE VIEW table_to_check_1_today AS
SELECT * FROM initial_table
WHERE dt = current_date;

And checkpoint be configured something like this:

name: my_task.my_check
config_version: 1.0
validations:
- expectation_suite_name: my_task.my_suite
  batch_request:
    datasource_name: my_datasource
    data_connector_name: default_inferred_data_connector_name
    data_asset_name: table_to_check_1_today

Ramil Gataullin
  • 131
  • 2
  • 5
0

Yes, a view can be created using the "current_date" - and the checkpoint can simply run the view. However, this would mean that the variable (current_date) is stored in the database - which may not be desirable; you might want to run the query in the checkpoint for a different date - which could be coming from a environment variable or elsewhere - to the CLI or python/notebook

Yet to find a solution where we can substitute a string in the checkpoint query; using a config variable from the file is a very static way - there may be different checkpoints running for different dates.