I would like to automate this processes for multiple datasets. I am not sure if there exists a tool like Alteryx or any software where I can do the following.
- Compare 2 similar datasets on a block aggregate level.
- If mismatch exists, zoom into that block to find where the inconsistencies occur.
- Get rows of data with ID where inconsistency happens. The goal is to explore on a high level because of the huge amount of data and open a more comprenhesive view of the inconsistency to point out the error easily.
Here is the data that exists in both SIT and PROD:
ID | Block | Amt A | Amt B | Amt C | .... |
---|---|---|---|---|---|
1 | X | 1000 | 400 | 1000 | .... |
2 | Y | 2000 | 500 | 4000 | .... |
3 | Z | 3000 | 500 | 3000 | .... |
2 | Y | 2000 | 3000 | 2000 | .... |
3 | Z | 3000 | 4500 | 3000 | .... |
- This is my PROD data after aggregating on 'block'
Block | Counts | Sum Amt A | Sum Amt B |
---|---|---|---|
X | 1000 | 40000 | 10000 |
Y | 2000 | 45000 | 20000 |
Z | 3000 | 50000 | 30000 |
- This is my SIT data after aggregating on 'block'
Block | Counts | Sum Amt A | Sum Amt B |
---|---|---|---|
X | 1000 | 40000 | 15000 |
Y | 2000 | 47000 | 20000 |
Z | 3000 | 50000 | 30000 |
FYI - There are multiple columns like block that could be used on aggregation. One could choose any column, get an aggregation to find inconsistencies and then explore into that particular block.