-1

I have a data with many similar column names (basically mis-spell words), for example:

apple    grapes    apples    bana    apyles    grayes    graph    banana

Here, I want to combine the columns 'apple, apples, apyles', then 'grapes, grayes, graph', and 'bana, banana'. How can I do this?

*Edited for comment:

Q. What do you mean when you say “combine”. Can you include sample input and output?

Ans.

Input

apple    grapes    apples    bana    apyles    grayes    graph    banana
  1         2         3        4        5         6        7         8

Output

apple    grape    banana
  9       15         12 
David Erickson
  • 16,433
  • 2
  • 19
  • 35
Math Avengers
  • 762
  • 4
  • 15

2 Answers2

2

With fuzzywuzzy you can try something as follows. Please note that the optimal fuzz.ratio that I could use to get this to work is 70:

import pandas as pd
from fuzzywuzzy import fuzz
l = []
correct = ['apple', 'grapes', 'banana']
cols = df.columns[df.columns.isin(correct)]
for col in cols:
    l.append([c for c in df.columns if fuzz.ratio(col,c) > 70])
df = df.T.reset_index()
for i in range(len(correct)):
    for j in l[i]:
        df['index'] = df['index'].replace(j, correct[i])
df = df.groupby('index').sum().T
df
Out[1]: 
index  apple  banana  grapes
0          9      12      15
David Erickson
  • 16,433
  • 2
  • 19
  • 35
2

You don't need a cutoff for the fuzz score. Just use the highest.

import pandas as pd
from fuzzywuzzy import fuzz

df = pd.DataFrame({'fruit':['apple' ,'grapes',  'apples',  'bana',  'apyles',  'grayes',  'graph', 'banana'],'count':[1,2,3,4,5,6,7,8]})

choices = ['apples','grapes','bananas']

transl ={el2:max([(fuzz.ratio(el1,el2),el1) for el1 in choices])[1] for el2 in df['fruit'] }

df = df.replace({'fruit': transl}).groupby(['fruit'])['count'].sum()

print(df)

Output:

fruit
apples      9
bananas    12
grapes     15
Name: count, dtype: int64
LevB
  • 925
  • 6
  • 10