How to fuzzy match names in a free text field using Python?

Question

I have 2 datasets that contain names and free text respectively. As there are lots of resources on matching similar text regardless of their sequence using fuzzy or TF-IDF e.g. Jayda Silva Todd, Todd Jayda Silva, Silva Todd Jayda. However, I am unsure how I can apply this technique to a free text field instead to extract any name match.

Names DataFrame:

S/N	Name
1	Jayda Silva Todd
2	Kerys Felix
3	Beauden Ventura
4	Giorgia Fleming

Free Text DataFrame:

Reference No	Name
1	Lorem Ipsum is simply dummy text Felix Kerys of the printing and typesetting industry.
2	Contrary to popular belief, Lorem Ipsum is not simply random text.
3	This text will return results as well although there's a slight spelling error Jayda Silva Lorem ipsum dolor sit amet, consectetur adipiscing elit
4	It is a long established fact that a reader will be distracted by the readable content Beauden, Ventur of a page when looking at its layout.

Expected Output (on Free Text DataFrame):

Reference No	Name	Expected Result (from Names Dataframe)
1	Lorem Ipsum is simply dummy text Felix Kerys of the printing and typesetting industry.	Kerys Felix
2	Contrary to popular belief, Lorem Ipsum is not simply random text.	"empty"
3	This text will return results as well although there's a slight spelling error Jayda Silva Lorem ipsum dolor sit amet, consectetur adipiscing elit	Jayda Silva Todd
4	It is a long established fact that a reader will be distracted by the readable content Beauden, Ventur of a page when looking at its layout.	Beauden Ventura

fuzzy is not deterministic, so how did you decide your expected result is correct? did you try library such as [fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/) — Lei Yang, Jan 13 '22 at 07:36
It can be based on the score returned from fuzzywuzzy, let's say 90%. I tried fuzzywuzzy with a name/address match but not with a list of names and a chunk of text. — Gabriel Choo, Jan 13 '22 at 07:43
suggest paste your sample code with fuzzywuzzy, the data itself is less relavant so you don't need put so much real world data. — Lei Yang, Jan 13 '22 at 07:45

Laurent · Answer 1 · 2022-01-17T16:38:07.033

Since you know what you are looking for (dataframe of names), you could try without fuzzy-logic.

import pandas as pd

persons = pd.DataFrame(
    {
        "S/N": [1, 2, 3, 4],
        "Names": [
            "Jayda Silva Todd",
            "Kerys Felix",
            "Beauden Ventura",
            "Giorgia Fleming",
        ],
    }
)


free_text = pd.DataFrame(
    {
        "No": {0: 1, 1: 2, 2: 3, 3: 4},
        "Text": {
            0: "Lorem Ipsum is simply dummy text Felix Kerys of the printing and typesetting industry.",
            1: "Contrary to popular belief, Lorem Ipsum is not simply random text.",
            2: "This text will return results as well although there's a slight spelling error Jayda Silva Lorem ipsum dolor sit amet, consectetur adipiscing elit",
            3: "It is a long established fact that a reader will be distracted by the readable content Beauden, Ventur of a page when looking at its layout.",
        },
    }
)

First, define what you are looking for:

names_to_match = {
    part: complete_name
    for complete_name in persons["Names"]
    for part in complete_name.split(" ")
}
# An example of key:pair values is "Jayda":"Jayda Silva Todd"

Then, define a helper function for comparison:

def match(names_to_match, text):
    for part in names_to_match.keys():
        if part in text:
            return names_to_match[part]

Finally, apply the function with Pandas map:

free_text["Result"] = free_text["Text"].map(lambda x: match(names_to_match, x) or "")

print(free_text)
# Output
   No                                               Text            Result
0   1  Lorem Ipsum is simply dummy text Felix Kerys o...       Kerys Felix
1   2  Contrary to popular belief, Lorem Ipsum is not...
2   3  This text will return results as well although...  Jayda Silva Todd
3   4  It is a long established fact that a reader wi...   Beauden Ventura

How to fuzzy match names in a free text field using Python?

1 Answers1