I have an input and a master file in which I first tried doing an exact match using the fpd.fuzzy_merge
function, which seems to be working perfectly. Afterwards, I want to start applying the levenshtein method with different thresholds on the remaining rows from the input file, excluding the ones that actually had an exact match in the previously executed command.
Sample input:
Raw desc Clean desc Brand
CAULIFLOWER TRI COLOR 6 CT SAL cauliflower tri color sal SYSCO
!SYRUP BLACKBERRY 2LB syrup blackberry TYSON
PUREE BRKFST SAUSAGE LINK 24CT puree brkfst sausage link TYSON
Master file:
Master raw desc Master clean desc SKU
SYRUP & BLACKBERRY 50Z syrup blackberry 2356123
VEGETABLES MXD 6-10 GCHC vegetables mxd gchc 4412620
3 LB PAPER FOOD BOAT 500/CS paper food boat 4551210
After I am using the following code for an exact match, the output is the following which is the desired one. The challenge is to run the remaining of the sample input that didn't have an exact match through the Levenshtein method with different thresholds.
results1 = fpd.fuzzy_merge(
sample_df, master_df, left_on="Clean desc", right_on="Master clean desc"
)
Output 1):
Raw desc Clean desc Brand Master raw desc Master clean desc SKU
!SYRUP BLACKBERRY 2LB syrup blackberry TYSON SYRUP & BLACKBERRY 5OZ syrup blackberry 2356123
Afterwards, I use the following code using the Levenshtein method with a specific threshold, but no matter what parameter I use it only runs on the two columns I specified. I want to be able to run and return all the columns from both dataframes.
results = fpd.fuzzy_merge(
sample_df,
master_df,
left_on="Clean desc",
right_on="Master clean desc",
method="levenshtein",
threshold=0.85,
join="left-outer",
keep="match",
)
Any recs would be much appreciated!