Validating datasets produced by identical Apache airflows

Question

I have the same workflow on two different environments. To validate that both workflows are identical, I feed the same input data to both workflows. If they are identical, I am expecting the output dataset of each workflow to be same.

In this requirement, I cannot alter the workflow in any way (add/remove DAG's etc.).

Which tool is best suited for this use case? I was reading up on data validation frameworks like Apache Griffin and Great Expectations. Can either of this be used for this use case? Or is there a simpler alternative?

Update: I forgot to mention that I want the validation process to be as non interactive as possible. Reading the Great Expectations tutorial, it talks about manually opening & running Jupyter notebooks and I want to minimize processes like this as much as possible. If that makes sense.

Update 2:

Dataset produced by workflow in first environment:

Name	Value
ABC	10
DEF	20

Dataset produced by workflow in second environment:

Name	Value
DEF	20
ABC	10

After running validation, I want the output to be that both datasets are identical even though they are in a different order.

Without seeing your output data or the final data store; it is difficult to make an informed answer. however, based on your requirements. I would be inclined to put them into a Pandas DF and compare. — dimButTries, Mar 19 '22 at 19:41
@dimButTries for this initial requirement, it will most probably simple tabular data exported to CSV files stored locally on a disk or a cloud bucket. Will check out Pandas. Not entirely familiar with Python yet but working on it — user929287171, Mar 19 '22 at 19:46
@dimButTries please check Update 2 in my question for a simplified idea on the planned datasets. — user929287171, Mar 19 '22 at 19:52
Thank you for preparing an example is the name unique in both datasets? — dimButTries, Mar 19 '22 at 20:03

Validating datasets produced by identical Apache airflows

0 Answers0