3

I have 2 datasets that contain names and free text respectively. As there are lots of resources on matching similar text regardless of their sequence using fuzzy or TF-IDF e.g. Jayda Silva Todd, Todd Jayda Silva, Silva Todd Jayda. However, I am unsure how I can apply this technique to a free text field instead to extract any name match.

Names DataFrame:

S/N Name
1 Jayda Silva Todd
2 Kerys Felix
3 Beauden Ventura
4 Giorgia Fleming

Free Text DataFrame:

Reference No Name
1 Lorem Ipsum is simply dummy text Felix Kerys of the printing and typesetting industry.
2 Contrary to popular belief, Lorem Ipsum is not simply random text.
3 This text will return results as well although there's a slight spelling error Jayda Silva Lorem ipsum dolor sit amet, consectetur adipiscing elit
4 It is a long established fact that a reader will be distracted by the readable content Beauden, Ventur of a page when looking at its layout.

Expected Output (on Free Text DataFrame):

Reference No Name Expected Result (from Names Dataframe)
1 Lorem Ipsum is simply dummy text Felix Kerys of the printing and typesetting industry. Kerys Felix
2 Contrary to popular belief, Lorem Ipsum is not simply random text. "empty"
3 This text will return results as well although there's a slight spelling error Jayda Silva Lorem ipsum dolor sit amet, consectetur adipiscing elit Jayda Silva Todd
4 It is a long established fact that a reader will be distracted by the readable content Beauden, Ventur of a page when looking at its layout. Beauden Ventura
  • fuzzy is not deterministic, so how did you decide your expected result is correct? did you try library such as [fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/) – Lei Yang Jan 13 '22 at 07:36
  • It can be based on the score returned from fuzzywuzzy, let's say 90%. I tried fuzzywuzzy with a name/address match but not with a list of names and a chunk of text. – Gabriel Choo Jan 13 '22 at 07:43
  • suggest paste your sample code with fuzzywuzzy, the data itself is less relavant so you don't need put so much real world data. – Lei Yang Jan 13 '22 at 07:45
  • Show us what you've tried. – Danny Varod Jan 16 '22 at 11:47

1 Answers1

0

Since you know what you are looking for (dataframe of names), you could try without fuzzy-logic.

import pandas as pd

persons = pd.DataFrame(
    {
        "S/N": [1, 2, 3, 4],
        "Names": [
            "Jayda Silva Todd",
            "Kerys Felix",
            "Beauden Ventura",
            "Giorgia Fleming",
        ],
    }
)


free_text = pd.DataFrame(
    {
        "No": {0: 1, 1: 2, 2: 3, 3: 4},
        "Text": {
            0: "Lorem Ipsum is simply dummy text Felix Kerys of the printing and typesetting industry.",
            1: "Contrary to popular belief, Lorem Ipsum is not simply random text.",
            2: "This text will return results as well although there's a slight spelling error Jayda Silva Lorem ipsum dolor sit amet, consectetur adipiscing elit",
            3: "It is a long established fact that a reader will be distracted by the readable content Beauden, Ventur of a page when looking at its layout.",
        },
    }
)

First, define what you are looking for:

names_to_match = {
    part: complete_name
    for complete_name in persons["Names"]
    for part in complete_name.split(" ")
}
# An example of key:pair values is "Jayda":"Jayda Silva Todd"

Then, define a helper function for comparison:

def match(names_to_match, text):
    for part in names_to_match.keys():
        if part in text:
            return names_to_match[part]

Finally, apply the function with Pandas map:

free_text["Result"] = free_text["Text"].map(lambda x: match(names_to_match, x) or "")

print(free_text)
# Output
   No                                               Text            Result
0   1  Lorem Ipsum is simply dummy text Felix Kerys o...       Kerys Felix
1   2  Contrary to popular belief, Lorem Ipsum is not...
2   3  This text will return results as well although...  Jayda Silva Todd
3   4  It is a long established fact that a reader wi...   Beauden Ventura
Laurent
  • 12,287
  • 7
  • 21
  • 37