Questions tagged [python-dedupe]

Questions about the dedupe python library (a library for probabilistic deduplication and record linkage)

Dedupe is an open source, Python library for probabilistic dedupliction, record linkage, and entity resolution.

67 questions
2
votes
1 answer

fuzzy duplicate check using python dedupe library error

I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error: {'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'}, 'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'}, 'Invoice Date':…
python_rok
  • 61
  • 1
  • 9
2
votes
1 answer

How to use pre labeled training data for Python Dedupe

I am using Python Dedupe package for record linkage tasks. It means matching Company names in one data set to other. The Dedupe package allows user to label pairs for training Logistic Regression model. However, it's a manual process and one need to…
usct01
  • 838
  • 7
  • 18
2
votes
1 answer

Dedupe Python - "Records do not line up with data model"

I am stuck with setting up python and the library dedupe from dedupe.io to deduplicate a set of entries in a postgres database. The error is - "Records do not line up with data model" which should be easy to solve but I just do not get why I get…
Pixelartist
  • 378
  • 5
  • 17
2
votes
1 answer

AttributeError: 'NoneType' object has no attribute 'learn_predicates'

I have information about found doubles in table learning, where entity_id is the same for doubles. I want to teach Dedupe by example, but get error. What am I doing wrong? con = psycopg2.connect(database=db_conf['NAME'], …
tatka
  • 301
  • 1
  • 3
  • 9
2
votes
1 answer

Python Record Linkage, Fuzzy Match and Deduplication

I have 3 dataset of customers with 7 columns. CustomerName Address Phone StoreName Mobile Longitude Latitude every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same…
Dr Sima
  • 135
  • 1
  • 12
2
votes
0 answers

Scaling Dedupe package functionality to large data using mysql DB

I have been now trying for a while to make a working example of the gazetteer/dedupe that scales to semi-large datasets connecting to SQL (using examples provided by the package) and have been unsuccessful. Would really appreciate if anyone could…
2
votes
2 answers

Use Python dedupe library to return all matches against messy dataset

First, if you haven't seen the Dedupe library for Python: it's awesome. Much like TensorFlow, it's a great way to bring machine learning to the masses (like me). I'm trying to do record linkage of names against a single, large, messy data set. I'm…
PANDA Stack
  • 1,293
  • 2
  • 11
  • 30
2
votes
1 answer

Increase max_components variable in dedupe library

How can I increase default value in max_components variable? By default max_components is set to 30000. I need increase this limit because every time I do deduplications (using the same datasets) I have different results. I think that the total…
mjimcua
  • 2,781
  • 3
  • 27
  • 47
2
votes
0 answers

Python Postgresql dedupe consuming a lot of time. Can there be any optimization?

I am using postgres dedupe example code. For 10,000 rows, it is consuming 163 seconds. I found that it is consuming most of the time in this part: full_data = [] cluster_membership = collections.defaultdict(lambda : 'x') for cluster_id,…
Shubham Singh
  • 91
  • 2
  • 12
2
votes
3 answers

Python deduplicate records - dedupe

I want to use https://github.com/datamade/dedupe to deduplicate some records in python. Looking at their examples data_d = {} for row in data: clean_row = [(k, preProcess(v)) for (k, v) in row.items()] row_id = int(row['id']) …
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
2
votes
2 answers

Setting explicit rules for matching records using Python Dedupe library

I'm using the Dedupe library to match person records to each other. My data includes name, date of birth, address, phone number and other personally identifying information. Here is my question: I always want to match two records with 100%…
blahblahblah
  • 2,299
  • 8
  • 45
  • 60
2
votes
1 answer

Python - Trouble with Dedupe: TypeError: unhashable type: 'numpy.ndarray'

I'm having trouble getting dedupe to run. I am trying to use this library to remove duplicates from a huge set of addresses. Here is my code: import collections import logging import optparse from numpy import nan import dedupe from unidecode…
Connor M
  • 182
  • 2
  • 12
1
vote
1 answer

pip install pylbfgs fails in a clean virtualenv

On a completely fresh virtualenv, installing pylbfgs fails with the error below. My goal is to install dedupe, but it depends on pylbfgs. I'm assuming it has something to do with the release of Cython 3.0.0 a few days ago, but even if I do pip…
Mathias Bak
  • 4,687
  • 4
  • 32
  • 42
1
vote
0 answers

Setting seed in python creating non consistent results on AWS

In my code I use the dedupe library to match records between 2 datasets. The underlying library uses random numbers from python's random library and numpy's random submodule, but it provides no way to set a seed for either. In our use case it is…
1
vote
1 answer

Pandas Dedupe: supplying self-created training data

I'm using pandas-dedupe to link a dataframe with mispellings to another with record-level info. Here is a much simplified example: df1 = pd.DataFrame({'a': ['cat', 'dog', 'frog', 'mouse', 'snake'], \ 'info': ['mammal', 'mammal', 'amphibian',…
svenkatesh
  • 1,152
  • 2
  • 10
  • 25