How to combine data columns with similar column names Pandas

Question

I have a data with many similar column names (basically mis-spell words), for example:

apple    grapes    apples    bana    apyles    grayes    graph    banana

Here, I want to combine the columns 'apple, apples, apyles', then 'grapes, grayes, graph', and 'bana, banana'. How can I do this?

*Edited for comment:

Q. What do you mean when you say “combine”. Can you include sample input and output?

Ans.

Input

apple    grapes    apples    bana    apyles    grayes    graph    banana
  1         2         3        4        5         6        7         8

Output

apple    grape    banana
  9       15         12

@Ukrainian-serge I just tried this out and i basically want any columns with 90% (fuzz.ratio()) to be combined/summed. But how can i apply this to each column? — Math Avengers, Jan 02 '21 at 06:14

score 2 · Accepted Answer · answered Jan 02 '21 at 07:16

With fuzzywuzzy you can try something as follows. Please note that the optimal fuzz.ratio that I could use to get this to work is 70:

import pandas as pd
from fuzzywuzzy import fuzz
l = []
correct = ['apple', 'grapes', 'banana']
cols = df.columns[df.columns.isin(correct)]
for col in cols:
    l.append([c for c in df.columns if fuzz.ratio(col,c) > 70])
df = df.T.reset_index()
for i in range(len(correct)):
    for j in l[i]:
        df['index'] = df['index'].replace(j, correct[i])
df = df.groupby('index').sum().T
df
Out[1]: 
index  apple  banana  grapes
0          9      12      15

LevB · Answer 2 · 2021-01-02T08:40:43.033

You don't need a cutoff for the fuzz score. Just use the highest.

import pandas as pd
from fuzzywuzzy import fuzz

df = pd.DataFrame({'fruit':['apple' ,'grapes',  'apples',  'bana',  'apyles',  'grayes',  'graph', 'banana'],'count':[1,2,3,4,5,6,7,8]})

choices = ['apples','grapes','bananas']

transl ={el2:max([(fuzz.ratio(el1,el2),el1) for el1 in choices])[1] for el2 in df['fruit'] }

df = df.replace({'fruit': transl}).groupby(['fruit'])['count'].sum()

print(df)

Output:

fruit
apples      9
bananas    12
grapes     15
Name: count, dtype: int64

How to combine data columns with similar column names Pandas

2 Answers2