1

I'm using pandas-dedupe to link a dataframe with mispellings to another with record-level info. Here is a much simplified example:

df1 = pd.DataFrame({'a': ['cat', 'dog', 'frog', 'mouse', 'snake'], \
       'info': ['mammal', 'mammal', 'amphibian', 'mammal', 'reptile']})

df2 = pd.DataFrame({'a': ['caat', 'mous', 'dog', 'xfrogg', 'snak', 'xyzgiraff']})

I have separate training data in csv file, which looks like this:

df3 = pd.DataFrame({'orig': ['caat', 'mous', 'dog'], 'correct':['cat', 'mouse', 'dog']})

How can I pass the labels in df3 as the training data in my call to pandas_dedupe.link_dataframes? I've tried reading the dedupe documentation, but I'm not sure how to format df3 so that I can pass it as training data.

svenkatesh
  • 1,152
  • 2
  • 10
  • 25

1 Answers1

0

My suggestion is to create labels using pandas-dedupe rather than passing your own labels into link_dataframes.

pandas-dedupe saves settings and labels into a *_settings and *_training.json file respectively. However, I would not encourage to add your labels to the file since you might create a mismatch between training and settings file.

iEriii
  • 403
  • 2
  • 7