Python dedupe library for bigdata

Question

I am working running the Dedupe package on large datasets (4 million records/ 5 fields) with the following objectives:

Deduplicate records (3.5 million)
Record link incremental data ~ 100K with ~1.1 million

Note: Everything is in memory on spark and DBFS.

I was able to run end to end dedupe on 60K records.
The program hangs for 100K records on the Dedupe.Clustor() method. Get a warning for max component nodes being limited to 30K

Summary of steps:

Block indexes
Pair(data) - 3.5 million pairs for 100K records
Score Pairs() - works fine, tested for 2 million input records and score pairs worked as expected
Dedupe.Clustor(score(pair)) - hangs with the below error, anytime I try to pass more than 60K records.

Kindly suggest any pointers or big data examples that I can refer. MySQL is currently not the primary plan.

Warning: "3730000 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0"

Please provide enough code so others can better understand or reproduce the problem. — Community, May 24 '22 at 03:54

score 0 · Answer 1 · answered May 24 '22 at 11:46

we are now using postgresql approach -> Refer : https://github.com/dedupeio/dedupe-examples/tree/master/pgsql_big_dedupe_example
Version used - 2.0.13 Total records 18K with 16 core, 64 GIG RAM its is taking 20 mins to run along with manual labelling without any memory crash.

First issue version 2.0.14 is throwing error on some compatibility issue (discussed here on different threads)

Also 2.0.14 was giving slow performance ..

If you running with > 10K data postgresql will give better performance .

Python dedupe library for bigdata

1 Answers1