How to merge similar strings in a pandas column together using fuzzywuzzy

Question

I am analysing data from a survey. One of the questions is about what games do you like the most. This is a free-text field, so sometimes the users answers 1, 2, 3 items and sometimes nothing. As it is a free text, the strings can be the same but maybe the user introduced an extra space or an additional character, or the name is misspelled. How can I replace these values that are similar so "don't know", "dk", "don't now", "don't remember" are counted as the same string?

Here is a snippet of the data frame.

                 Q4_1       Q4_2       Q4_3         Q4_4 Q4_5 Q4_6 Q4_7 Q4_8  \
0           dark soul  valkiring        NaN          NaN  NaN  NaN  NaN  NaN   
1          Don't know        NaN        NaN          NaN  NaN  NaN  NaN  NaN   
2   World of Warcraft  Fallout 3  Fallout 4          NaN  NaN  NaN  NaN  NaN   
3          Don`t know        NaN        NaN          NaN  NaN  NaN  NaN  NaN   
4            warcraft        NaN        NaN          NaN  NaN  NaN  NaN  NaN   
5          don't know        NaN        NaN          NaN  NaN  NaN  NaN  NaN   
6  Mass Effect Series     Skyrim  Fallout 4  Tomb Raider  NaN  NaN  NaN  NaN   
7          dark souls        NaN        NaN          NaN  NaN  NaN  NaN  NaN   
8                none        NaN        NaN          NaN  NaN  NaN  NaN  NaN   
9         candy cruss        NaN        NaN          NaN  NaN  NaN  NaN  NaN   

  Q4_9 Q4_10  
0  NaN   NaN  
1  NaN   NaN  
2  NaN   NaN  
3  NaN   NaN  
4  NaN   NaN  
5  NaN   NaN  
6  NaN   NaN  
7  NaN   NaN  
8  NaN   NaN  
9  NaN   NaN

print(df_survey_Q4.head(10).stack())

0  Q4_1             dark soul
   Q4_2             valkiring
1  Q4_1            Don't know
2  Q4_1     World of Warcraft
   Q4_2             Fallout 3
   Q4_3             Fallout 4
3  Q4_1            Don`t know
4  Q4_1              warcraft
5  Q4_1            don't know
6  Q4_1    Mass Effect Series
   Q4_2                Skyrim
   Q4_3             Fallout 4
   Q4_4           Tomb Raider
7  Q4_1            dark souls
8  Q4_1                  none
9  Q4_1           candy cruss
dtype: object

print(df_survey_Q4.head(10).stack().value_counts())

Fallout 4             2
Skyrim                1
Fallout 3             1
World of Warcraft     1
valkiring             1
don't know            1
Tomb Raider           1
none                  1
warcraft              1
dark souls            1
Mass Effect Series    1
dark soul             1
Don`t know            1
Don't know            1
candy cruss           1
dtype: int64

So in this snippet, I would like that Don't know, Don`t know and none are gathered together as a "Don't know" and it counts as 3, instead of everyone counting as 1.

maybe my answer to a similar question helps you: https://stackoverflow.com/a/62027240/3944322 as a starting point. — Stef, Jun 13 '20 at 13:36
If you're only concerned about the "don't know" answer, you could also simply replace answers the match a certain pattern (e.g. `(?:[dD]on.*know)|dk` in the simplest cast) with `'don't know'`. — Stef, Jun 13 '20 at 13:42

How to merge similar strings in a pandas column together using fuzzywuzzy

0 Answers0