0

I've been reading about anchor modeling and really like the concept. My hope is to possibly incorporate it into a data management framework where I consolidate multiple data sources into an anchor model, then either make it directly available or have it feed data marts for our data scientists.

But I'm not sure how to approach entity resolution. The guidelines state no updates, only inserts, with the option to delete only to remove erroneous data. So now lets say my source system(s) have duplicate entities (eg. John Smith appears more than once), and this makes its way into my anchor model? What is the best way to clean this up?

My rubber duck is telling me to create an entity resolution layer on top of my anchor model that looks for these issues and corrects them. Correcting would mean merging entities in anchors and fixing subsequent ties accordingly. But now I'm updating my anchor model...which is against best practices.

Or am I looking at this wrong....and entity resolution should be dealt with before data gets into the anchor model? But mistakes can happen, and it would be nice to know I could address the issue inside the anchor model should it present itself.

Pickle
  • 33
  • 8
  • Had a few days to think about this. I'm thinking that I will be loading data into the anchor model, but then running an entity resolution scan separately after load. When an entity has been resolved, I'll save this is an attribute table connected to the associated anchor. For example, if I discover John Smith is the same person as Jonny Smith, then I'll connect them in an ER attribute using some kind of batch key to connect them. – Pickle Feb 13 '23 at 15:54
  • So that's good......but then do I just leave it at that? Or do I get more proactive with the ER attribute table and use it to merge Person records in the anchor table. Problem is that I'll need to modify more than just the anchor table......any connect ties and other attribute tables will also need modification. – Pickle Feb 13 '23 at 15:58

0 Answers0