fuzzy duplicate check using python dedupe library error

Question

I'm trying to use the python dedupe library to perform a fuzzy duplicate check on my mock data, but i keep getting this error:

{'Vendor': {0: 'ABC', 1: 'ABC', 2: 'TIM'},
 'Doc Date': {0: '5/12/2019', 1: '5/13/2019', 2: '4/15/2019'},
 'Invoice Date': {0: '5/10/2019', 1: '5/10/2019', 2: '4/10/2019'},
 'Invoice Ref Num': {0: 'ABCDE56.', 1: 'ABCDE56', 2: 'RTET5SDF'},
 'Invoice Amount': {0: '56', 1: '56', 2: '100'}}

IndexError: Cannot choose from an empty sequence

Here's the code that i'm using:

import pandas as pd 
import pandas_dedupe

df = pd.read_csv("duptest.csv") df.columns

df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'])

Any idea what i'm doing wrong? thanks.

score 2 · Answer 1 · answered Jul 05 '20 at 19:04

pandas-dedupe create a sample of observations you need to label. The default amount of observation is equal to 30% of your dataframe. In your case you have too few examples in you dataframe to start active learning.

If you sample_size=1 as follows:

df = pandas_dedupe.dedupe_dataframe(df,['Vendor','Invoice Ref Num','Invoice Amount'], sample_size=1)

you will be able to dedupe you data :)

fuzzy duplicate check using python dedupe library error

1 Answers1