0

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.

I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.

So roughly speaking something like this:-

enter image description here

For examples sake the following data then exists in the result database from the first flow :-

enter image description here

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

enter image description here

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.

Couple questions:-

  • How would I trigger dedupe to run on the merged content was pushed to the database?


  • The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?


I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).

Thanks!

Gavin Gilmour
  • 6,833
  • 4
  • 40
  • 44

1 Answers1

1

For de-duplicating...

You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.

For triggering the second flow...

Do you need that intermediate DB table for something else?

If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.

If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Bryan Bende
  • 18,320
  • 1
  • 28
  • 39
  • Hi Bryan, thanks for the reply. I don’t particularly need the intermediate table I was just put off from using ExecuteScript due to a talk I watched here https://www.youtube.com/watch?v=fblkgr1PJ0o&feature=youtu.be&t=3149 - I thought it’d be a bit more robust to store the results elsewhere and then have an external tool work off them. The other thing is dedupe has a manual user processing step for training the data which would need to be done… Now wondering if I need a data lake on or if Kafka/Spark has a role to play in this. New to this tech and struggling to find examples in the wild! – Gavin Gilmour Aug 17 '18 at 13:57