1

I have the same workflow on two different environments. To validate that both workflows are identical, I feed the same input data to both workflows. If they are identical, I am expecting the output dataset of each workflow to be same.

In this requirement, I cannot alter the workflow in any way (add/remove DAG's etc.).

Which tool is best suited for this use case? I was reading up on data validation frameworks like Apache Griffin and Great Expectations. Can either of this be used for this use case? Or is there a simpler alternative?

Update: I forgot to mention that I want the validation process to be as non interactive as possible. Reading the Great Expectations tutorial, it talks about manually opening & running Jupyter notebooks and I want to minimize processes like this as much as possible. If that makes sense.

Update 2:

Dataset produced by workflow in first environment:

Name Value
ABC 10
DEF 20

Dataset produced by workflow in second environment:

Name Value
DEF 20
ABC 10

After running validation, I want the output to be that both datasets are identical even though they are in a different order.

  • Without seeing your output data or the final data store; it is difficult to make an informed answer. however, based on your requirements. I would be inclined to put them into a Pandas DF and compare. – dimButTries Mar 19 '22 at 19:41
  • @dimButTries for this initial requirement, it will most probably simple tabular data exported to CSV files stored locally on a disk or a cloud bucket. Will check out Pandas. Not entirely familiar with Python yet but working on it – user929287171 Mar 19 '22 at 19:46
  • @dimButTries please check Update 2 in my question for a simplified idea on the planned datasets. – user929287171 Mar 19 '22 at 19:52
  • Thank you for preparing an example is the name unique in both datasets? – dimButTries Mar 19 '22 at 20:03
  • @dimButTries Nope. Name is a non unique column – user929287171 Mar 19 '22 at 20:19

0 Answers0