Questions tagged [python-dedupe]

Questions about the dedupe python library (a library for probabilistic deduplication and record linkage)

Dedupe is an open source, Python library for probabilistic dedupliction, record linkage, and entity resolution.

67 questions
1
vote
1 answer

How to solve the issue of malformed node or string error in pandas?

Here I have this dataframe and I am trying to remove the duplicate elements from each array in column 2 as follows and resultant array in Column 3. Column1 Column 2 Column3 0 …
Cuckoo
  • 97
  • 9
1
vote
0 answers

High CPU and memory utilization for python dedupe

I'm running a python deduping application using the dedupe package. I've deployed the same as an API using flask and gunicorn. I'm running the application on a linux server with 128GB RAM and 40 core configuration. With a data size of 900000, the…
Eswar
  • 1,201
  • 19
  • 45
1
vote
0 answers

Special charecter removal from dataframe in pandas while using Deduplication

I am using deduplication in my data frame I am getting warning- import pandas as pd import numpy as np import pandas_dedupe scholar=pd.read_csv('Scholar.csv') Scholar_final=pandas_dedupe.dedupe_dataframe(scholar,['idScholar','ROW_ID']) Warning…
1
vote
0 answers

Dedupe library in python - problem with log file

I got some issues with creating a log file using dedupe: this is the syntax I use to create the log file: import datetime import sys global log_log_file def writeErrorLogMessage(message): execution_log_line=str(datetime.datetime.now())+', -…
1
vote
1 answer

How is having < comparison different than != in following case?

I'm trying to understand this example at https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html. How is having < comparison different than != in following case? read_cur.execute(""" select a.donor_id, …
Scapedee
  • 47
  • 7
1
vote
0 answers

Python Dedupe Library Compare List of Attributes

I'm attempting to link records between datasets with no common key to identify matches. For both datasets I may have none, one, or more addresses per record. How do you best setup the Python dedupe library to handle lists? I've poured over Google…
Douglas Plumley
  • 565
  • 5
  • 21
1
vote
2 answers

Getting a KeyError when trying to run De-dupe

Hi I'm new to Python and I've no clue how to fix the following error: I've a data frame with around 2 million records & 20 columns of stores data, I am grouping the stores by State and trying to run dedupe_dataframe on each state after training it…
1
vote
1 answer

Python Dedupe.io problem reading data from SQL Server

I am trying to pull a large dataset from SQL Server and dedupe the information using Python's dedupe library. I am using pyodbc as the db connector but I cannot figure out how to get the data into the correct format using SQL Server. Works OK on…
WmSadler
  • 21
  • 4
1
vote
1 answer

How to erase pandas_dedupe.dedupe_dataframe training set?

I am working with the python pandas_dedupe package, specifically with pandas_dedupe.dedupe_dataframe. I have trained the dedupe_dataframe module via the interactive prompts. But now I need to retrain the dedupe_dataframe module. How can I erase the…
Stefan
  • 53
  • 1
  • 1
  • 5
1
vote
1 answer

deduper.blocker() function - cannot unpack non-iterable int object

I am trying to use the dedupe.io Python library, however for my needs I need to connect to a MS-SQL database. So I decided first get the csv example working (which I did) then I thought I would try and convert the pgSQL example to a MS-SQL version.…
1
vote
2 answers

"error: command 'cl.exe' failed: No such file or directory" - Python Dedupe Installtion

I am trying to install dedupe module and I am getting an error below, error: command 'cl.exe' failed: No such file or directory Failed building wheel for dedupe Failed building wheel for dedupe-hcluster Failed building wheel for…
user9431057
  • 1,203
  • 1
  • 14
  • 28
1
vote
1 answer

Clustering Components

When clustering I receive the following warning UserWarning: A component contained 77760 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 4.08109134074e-15 What does this mean? My original thereshold…
Rtab
  • 123
  • 10
1
vote
1 answer

How to understand Dedupe library?

Two questions: How to interpret the 'confidence score' when there is cluster with 3 rows and 3 confidence scores (0.98, 0.45, 0.45). Where this confidence scores come from? From logistic regression or somehow from hierarchical clustering? 10 000 of…
lubom
  • 329
  • 2
  • 13
1
vote
0 answers

Cluster New Record in Dedupe Clustered Table

I am using Python Dedupe for de-duplication for our MDM database, So far it works fine after sufficient training and a entity map table is formed which shows you the Cluster_id's, Canonical name and a score. I'm stucked and not sure for a new record…
min2bro
  • 4,509
  • 5
  • 29
  • 55
1
vote
0 answers

Dedupe - AttributeError: 'NoneType' object has no attribute 'indexAll'

I'm using dedupe library and everything works fine until training data is used for dedupe but while calculating the threshold with the same data set it gives the following error: deduper.threshold(data_d, recall_weight=2) AttributeError: 'NoneType'…
min2bro
  • 4,509
  • 5
  • 29
  • 55