1

I have data that includes a group (Area) and then also provides a name. I am trying to merge two data frames. One frame is much smaller and is the "mapping" data frame. It has one row for each Name within an Area. The other frame is much larger and is the "instances" data frame. It has many rows with variations of the names with the Area. I want the info from the mapping frame to be merged to the instances frame so that each instance has the information that the mapping frame provides accorded to it.

I have looked into the Fuzzy package but have not found a way to implement it within groups (Area column) or how to effectively use it with cells that hold multiple strings and trying to match based on that.

Below is a sample of similar data. The merge needs to pull the relevant info from the instances data frame in order to map it to the mapping data frame.

Mapping Dataframe

Area    Name
A   Apple  
A   Orange Strawberry 
A   Blackberry Rasberry 
B   Blackberry 
C   Kiwi  
C   Apple  

Instances Dataframe

Area    Locale
A   Apple Pear Tomato
A   Orange Potato Strawberry Zuccini
A   Blackberry Rasberry 
B   Blackberry Onion
B   Lettuce Blackberry Cucumber 
C   Kiwi Spinach Pineapple
C   Kiwi Potato 
C   Apple Cucumber 
C   Apple Potato 

The goal is to get a frame that looks like below.

Dataframe

Area    Locale                                  Name
A   Apple Pear Tomato                   Apple  
A   Orange Potato Strawberry Zuccini    Orange Strawberry 
A   Blackberry Rasberry                 Blackberry Rasberry 
B   Blackberry Onion                    Blackberry 
B   Lettuce Blackberry Cucumber.        Blackberry 
C   Kiwi Spinach Pineapple              Kiwi  
C   Kiwi Potato                         Kiwi  
C   Apple Cucumber                      Apple 
C   Apple Potato                        Apple 
Kskiaskd
  • 35
  • 5

1 Answers1

1

Using fuzz and pandas, you can get you your expected outcome.

1) start by gathering your libraries and data:

import pandas as pd
from fuzzywuzzy import fuzz
mapping = pd.read_excel('Book1.xlsx', sheet_name='mapping')
instance = pd.read_excel('Book1.xlsx', sheet_name='instance')

2) create list of unique values of your areas:

unique_area = instance['Area'].drop_duplicates(keep='first').values.tolist()

3) create an empty dataframe and run for each of the area the following code:

fuzzed_data = []
for i in unique_area:
    instanceunique = instance[instance['Area'] == i]
    unique_list = mapping[mapping['Area'] == i]['Name'].drop_duplicates(keep='first').values.tolist()
    instance_score = instanceunique[['Locale']]
    for i in unique_list:
        ratiofuzz = []
        for index, row in instance_score.iterrows():
                ratiofuzz.append(fuzz.ratio(row['Locale'], i))
        instance_score[i] = ratiofuzz
    scores = instance_score.drop(['Locale'], axis=1)
    instance_score['mapping'] = scores.idxmax(axis=1)
    instanceunique = pd.merge(instanceunique
                          , instance_score[['Locale', 'mapping']]
                          , how='left'
                          , on=['Locale'])
    fuzzed_data.append(instanceunique)

4) Concatenate the "fuzzed" data:

fuzzed_data = pd.concat(fuzzed_data, axis=0)

That's it! Let me know if you need more help. BR

RenauV
  • 363
  • 2
  • 11
  • Thank you for your response. After running the code I got the error “Length of values does not match length of index” – Kskiaskd Apr 25 '19 at 15:22
  • I have tried it and it worked on my side with the example you gave above. Maybe you are not using the same data? Also, please note that the last part (pd.concat) is outside of the loop. – RenauV Apr 25 '19 at 15:23