I am working running the Dedupe package on large datasets (4 million records/ 5 fields) with the following objectives:
- Deduplicate records (3.5 million)
- Record link incremental data ~ 100K with ~1.1 million
Note: Everything is in memory on spark and DBFS.
- I was able to run end to end dedupe on 60K records.
- The program hangs for 100K records on the Dedupe.Clustor() method. Get a warning for max component nodes being limited to 30K
Summary of steps:
Block indexes
Pair(data) - 3.5 million pairs for 100K records
Score Pairs() - works fine, tested for 2 million input records and score pairs worked as expected
Dedupe.Clustor(score(pair)) - hangs with the below error, anytime I try to pass more than 60K records.
Kindly suggest any pointers or big data examples that I can refer. MySQL is currently not the primary plan.
Warning: "3730000 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0"