I'm auditing some financial data between two transaction systems. The two datasets consist of lists of transactions. Each has,
- timestamp,
- idiosyncratic transaction IDs (not shared between systems)
- transaction values (item names, item values, totals, adjustments).
The data is imperfect; 10-20% of the data doesn't align well.
I've cobbled together a solution to the problem using Pandas: What's a more efficient way to merge rows from DataFrames row-by-row with conditions?.
A simple pandas.merge()
is too greedy and produces many erroneous matches. A good match is on several criteria across the dataset.
What are some alternatives to using Pandas? As I researched the problem, I found some Pandas answers suggest operations which cause me to be concerned about dropped data or duplication.
To put it into the form of a better Stackoverflow question: What are some popular algorithms or data structures useful for matching activity between two separate systems?
- Something that doesn't mutate the original dataset,
- won't drop items from the original dataset,
- works with the assumption each entry has a match which is either present or missing from the other datasets