How can I optimize MDM solution with fuzzy algorithm in Pyspark?

Asked May 29 '23 at 12:01

Active May 29 '23 at 12:01

Viewed 18 times

For writing a custom MDM solution, each record is compared with each record (sort of cross join) to get the match case i.e., if we have N records with M columns then in worst case (NNM) comparison are required. Moreover, if it involves fuzzy algorithms for each column comparisons (e.g., Levenstein with complexity O(pq) where p & q are length of string1 and string2) then complexity of overall run increased exponentially.

Few options already considered, if logical mismatches are known then avoid those comparison with others e.g., if address is getting matched then don't compare cross-state or cross-country records because those are logically mismatch.

What is the best way to write custom MDM solution for these scenarios using Pyspark (Databricks)?

asked May 29 '23 at 12:01

Abhishek

How can I optimize MDM solution with fuzzy algorithm in Pyspark?

0 Answers0