dealing with multiple simliar entities in panda dataframe

Question

I have a dataframe with 'Name' column. There are multiple similar entryies with some inconsistencies. I want to merge them to one. I am a starter in data analysis and came to know about fuzzywuzzy module. I tried in below way

names = list(data['Name'].unique())

def replace_matches(df, column, matching_string, min_ratio = 90):

    strings = df[column].unique()
    for i in matching_string:
        matches = fuzzywuzzy.process.extract(i, strings, limit= 5, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
        close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
        matched_rows = df[column].isin(close_matches)
        df.loc[matched_rows, column] = matching_string
    return df

I am calling the function below:

replace_matches(df = data, column = 'Name', matching_string = names)

but it is giving ValueError: Must have equal len keys and value when setting with an iterable.

What is wrong here? is there any other efficient way to check all the similar kind of entry in a column?

How do you want to merge? Do you want a dataframe with only unique values in the 'Name' column? And then, what do you do with the rest of the columns? Sum them? Take the mean? Take the first entry only? — Jeroen, Aug 03 '18 at 11:32
those are duplicate entries; some with extra space or dot; so, I want one row for these similar kind of names/words. — S.Dasgupta, Aug 03 '18 at 11:38
okay, so you want to group for example 'Hello World' and 'HelloWorld' together, and 'Name' is your only column? — Jeroen, Aug 03 '18 at 11:39
yes, but I have three others column and I don't bother if they get collapsed — S.Dasgupta, Aug 03 '18 at 11:40
It might be helpful if you included an example of your data and the desired outcome in your question, so that the given solutions will be tailored to your problem. — Jeroen, Aug 03 '18 at 11:48
if the difference between similar words in the column 'Name' are just space and dot, you can just replace these "characters" by nothing and then the words would match, like: `df.Name.str.replace('\.|\s','')` — Ben.T, Aug 03 '18 at 14:22
and the error you get is not from `df.loc[matched_rows, column] = matching_string`? I think what you want to do here would be more `df.loc[matched_rows, column] = i`, no? — Ben.T, Aug 03 '18 at 14:30
thanks for pointing out! It's working for small data but for large data it is taking lot of time. — S.Dasgupta, Aug 06 '18 at 06:07

dealing with multiple simliar entities in panda dataframe

0 Answers0