Questions tagged [python-dedupe]

Questions about the dedupe python library (a library for probabilistic deduplication and record linkage)

Dedupe is an open source, Python library for probabilistic dedupliction, record linkage, and entity resolution.

67 questions
1
vote
0 answers

Structuring dedupe results in a database

I am using the python project dedupe to find duplicate organization names in my data. Many of the examples are focused on how to process the data and not how the results are implemented. Are there any best practices for taking the results, putting…
Casey
  • 2,611
  • 6
  • 34
  • 60
1
vote
1 answer

How do you make a gazetteer for Dedupe when individuals have multiple addresses?

According to the datamade Dedupe documentation, it seems like a gazetteer needs to have clean, distinct individual-level data. What do you do if the individual has moved, changed jobs, etc a bunch of times? Include multiple observations per…
Luke
  • 6,699
  • 13
  • 50
  • 88
0
votes
0 answers

Output will not Export to Excel

My output will not export to excel. I am not sure what I am missing in my python script. I have installed the openpyx1 package. The script works where I am merging two different datasets but I am not being able to access the output. import pandas as…
0
votes
0 answers

How to fix Key Error of variable not found in index? Python

I am trying to fuzzy match two different datasets based on the name column in python. I am getting an error code that the name is not found in the index. The name variable is a column in both datasets. Can anyone provide me with any suggestions to…
0
votes
0 answers

Matching csv files KeyError message label is not found

I am using python record linkage and I am trying to merge two csv files by fuzzy matching by company name and state. While running the code, I get a KeyError message about label not being found and I do not understand what I need to do on my end to…
0
votes
1 answer

Installing python pacakages in Mac

I am a new user to python and I am trying to install a package in Visual Code using anaconda. I type in the following code pip install pandas-dedupe But I receive this error /Users/nathang./anaconda3/bin/python…
0
votes
0 answers

How do I import a library when numpy has no attribute float?

I want to use the library dedupe but I can't import it. I've installed it via 'pip install dedupe' but when I try to import 'pandas_dedupe' as shown in this video . I get the following error message: AttributeError: module 'numpy' has no attribute…
0
votes
0 answers

Difference between the Gazetteer and Linker in Dedupe for Streaming Data

I am using the dedupe package and I'm having trouble understanding the difference between the gazetteer and linker. I've read the documentation, but it seems a bit unclear to me. I'm already able to compute resolutions with the deduper class, but…
lnathan
  • 527
  • 5
  • 15
0
votes
1 answer

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the…
user1597990
  • 181
  • 3
  • 13
0
votes
0 answers

How to Integrate Dedupe Active Learning functinolity (Console_label) with restful api

below code to train csv model and active learning involved init how to integrate console_label function with restful api(eg fastapi) Create a new deduper object and pass our data model to it. deduper = dedupe.Dedupe(fields) # If we have…
0
votes
1 answer

Python dedupe library for bigdata

I am working running the Dedupe package on large datasets (4 million records/ 5 fields) with the following objectives: Deduplicate records (3.5 million) Record link incremental data ~ 100K with ~1.1 million Note: Everything is in memory on spark…
0
votes
1 answer

Python3 match, reverse match and dedupe

The intention of the code below is to process the two dictionaries and add matching symbol values from each dictionary to the pairs list if the value contains the item in cur but not if the value contains either item in the curpair list. I'm…
Jason
  • 404
  • 4
  • 14
0
votes
1 answer

How do I apply the findings of a Pandas GroupBy to the source data

I'm doing a name de-dupe using pandas de_dupe and have multiple steps. Firstly I train and de-dupe the source data. deDupedNames = dedupe_dataframe( sourceData, columnsOfInterest, config_name=configName) Next I discard data sets where the cluster…
Monza
  • 745
  • 4
  • 12
0
votes
1 answer

there is any type in python dedupe library to cross phone match

I'm using the Dedupe library to match person records to each other. My data includes first_name,last_name, email,phone1,phone2,phone3 and address information. Here is my question: I always want to match two records with 80% to 99% confidence if they…
shael
  • 177
  • 9
0
votes
1 answer

Is there a performance difference between `dedupe.match(generator=True)` and `dedupe.matchBlocks()` for large datasets?

I'm preparing to run dedupe on a fairly large dataset (400,000 rows) with Python. In the documentation for the DedupeMatching class, there are both the match and matchBlocks functions. For match the docs suggest to only use on small to moderately…