1

I'm auditing some financial data between two transaction systems. The two datasets consist of lists of transactions. Each has,

  • timestamp,
  • idiosyncratic transaction IDs (not shared between systems)
  • transaction values (item names, item values, totals, adjustments).

The data is imperfect; 10-20% of the data doesn't align well.

I've cobbled together a solution to the problem using Pandas: What's a more efficient way to merge rows from DataFrames row-by-row with conditions?. A simple pandas.merge() is too greedy and produces many erroneous matches. A good match is on several criteria across the dataset.

What are some alternatives to using Pandas? As I researched the problem, I found some Pandas answers suggest operations which cause me to be concerned about dropped data or duplication.

To put it into the form of a better Stackoverflow question: What are some popular algorithms or data structures useful for matching activity between two separate systems?

  • Something that doesn't mutate the original dataset,
  • won't drop items from the original dataset,
  • works with the assumption each entry has a match which is either present or missing from the other datasets
xtian
  • 2,765
  • 7
  • 38
  • 65
  • `merge_asof()` does a merge on "close" matches, and can either match to the nearest greater or less than. Have you tried that? – elPastor Mar 19 '17 at 02:57

0 Answers0