I'm using pandas-dedupe
to link a dataframe with mispellings to another with record-level info. Here is a much simplified example:
df1 = pd.DataFrame({'a': ['cat', 'dog', 'frog', 'mouse', 'snake'], \
'info': ['mammal', 'mammal', 'amphibian', 'mammal', 'reptile']})
df2 = pd.DataFrame({'a': ['caat', 'mous', 'dog', 'xfrogg', 'snak', 'xyzgiraff']})
I have separate training data in csv file, which looks like this:
df3 = pd.DataFrame({'orig': ['caat', 'mous', 'dog'], 'correct':['cat', 'mouse', 'dog']})
How can I pass the labels in df3
as the training data in my call to pandas_dedupe.link_dataframes
? I've tried reading the dedupe
documentation, but I'm not sure how to format df3
so that I can pass it as training data.