0

Say I have a dataframe A as follows:

id        full_name
1           ABC Ltd.
2           NY BCD Ltd.
3           SH ZY Ltd.
4           Soho Company

also another dataframe B:

name                     id
 ABC
 NY BCD
 SH ZY
 Soho

If name in B is in full_name, then I I want to place the values of id from dataframe A to id of B. How can I do that in Python? Thanks.

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • you can try with `p='({})'.format('\\b|\\b'.join(B.name))` and then `A=A.assign(match=A.full_name.str.extract(p,expand=False))` which gives you a match, then map or merge with B will work...?? – anky Jul 02 '19 at 08:02
  • Thanks, what `'\\b|\\b'` stands for here? – ah bon Jul 02 '19 at 09:39
  • joining every word with a word boundary `\\b` – anky Jul 02 '19 at 09:51

1 Answers1

1

How about this soultion with a fuzzy lookup?

import pandas as pd
import difflib 
df1 = pd.DataFrame({"id": [1, 2, 3, 4], "full_name": ["ABC Ltd.", "NY BCD Ltd.", "SH ZY Ltd.", "Soho Company"]})

df2 = pd.DataFrame({"name": ["ABC", "SH ZY", "NY BCD", "Soho"]})

df2["full_name"] = df2["name"].map(lambda x: difflib.get_close_matches(x, df1["full_name"], cutoff=0.5)[0])

df2 = pd.merge(df2, df1, how="left", on="full_name")

THe lookup looks for close matches and returns the first value, the mathc is only given if at least 50% of the string matches.

The end_result for df2 would look something like this:

     name     full_name  id
0     ABC      ABC Ltd.   1
1   SH ZY    SH ZY Ltd.   3
2  NY BCD   NY BCD Ltd.   2
3    Soho  Soho Company   4
  • Thank you. Fuzzy lookup is good idea. But i got an error `IndexError: list index out of range` with my real data. Maybe because my actual is Chinese. Can we use `isin` at this case? – ah bon Jul 02 '19 at 09:39
  • Aat what part do you get the `index out of range` exception? The lookup `difflib.get_close_matches(x, df1["full_name"], cutoff=0.5)[0]` takes the first matching name, but when there is no match the list is empty an it will try to access the first element of an element which violates the Index. You can make sure that it only takes an element if there is one available other wise use a None value. – pfreiberger Jul 02 '19 at 09:57
  • `df2["name"].map(lambda x: difflib.get_close_matches(x, df1["full_name"], cutoff=0.5)[0])`, with this I got `index out of range` – ah bon Jul 02 '19 at 10:05
  • Maybe also we can use `fuzzywuzzy` and merge two dataframes, please check here: https://stackoverflow.com/questions/51934474/find-the-similarity-between-two-string-columns-of-a-dataframe – ah bon Jul 02 '19 at 10:23
  • Just run `df2["name"].map(lambda x: difflib.get_close_matches(x, df1["full_name"], cutoff=0.5))` and check what the out is. I am pretty sure at least one of the the sults returns an empty list `[]` – pfreiberger Jul 02 '19 at 10:55
  • Yes, there are many empty lists. At this case, what should I do? Maybe fuzzy match? – ah bon Jul 02 '19 at 10:56
  • 1
    `matches = df2["name"].map(lambda x: difflib.get_close_matches(x, df1["full_name"], cutoff=0.5))` `df2["full_name"] = [match[0] if match else None for match in matches]` Try to split up the query like this: you will get `None` values which won't have a match in the other frame but the query is more secure – pfreiberger Jul 02 '19 at 11:02