Using pandas to identify which are the 5 point and 10 point scale survey questions in a dataframe

Question

I have a unique situation where my dataset contains multiple survey responses that were asked on two different scales primarily - a 5 point scale and then a 10 point scale and I have consolidated all of these responses in one dataframe. Now I would like to split and create a new column in my dataframe that can tell by looking into the responses and correspondinlgy identify it it's a 5 point scale or a 10 point scale question. For a response where there are no numbers mentioned such as 1-5 scale or 1-10 scale, the output should be blank. My dataframe looks like:

Question_Text
on a scale of 1 â€“ 10 how well would you rate the following statements.
on a scale of 1 to 10 how well would you rate the following statements.
on a scale of 1-10 how well would you rate the following statements.
on a scale of 1 10 how well would you rate the following statements.
on a scale of 1 â€“ 5 how well would you rate the following statements.
on a scale of 1 to 5 how well would you rate the following statements.
on a scale of 1-5 how well would you rate the following statements.
on a scale of 1 5 how well would you rate the following statements.
please tell us how ready you feel for this (0 - 6 not ready, 6-8 somewhat ready, and 9-10 ready)
how useful did you find the todayâ€™s webinar?

and what I would like to achieve looks like:

Question_Text                                                           Type_of_Question
on a scale of 1 â€“ 10 how well would you rate the following            10 point scale
on a scale of 1 to 10 how well would you rate the following             10 point scale
on a scale of 1 to 5 how well would you rate the following              5 point scale
please tell us how ready you feel for this (0 - 6 not ready)...         10 point scale
how useful did you find the todayâ€™s webinar?

...

Is there any possible way to achieve this? Can a pattern be identified using regex that can take care of different sorts of inputs as I have shown above?

mozway · Accepted Answer · 2023-01-11T05:14:07.947

Assuming "Question_Text" the column of interest and that you only have numbers that describe the scale, you could extractall and get the max number to compare it to 5. If greater, we assume it's a 10 points scale:

import numpy as np

num = (df['Question_Text']
       .str.extractall('(\d+)')[0]
       .astype(int)
       .groupby(level=0).max()
       .gt(5).reindex(df.index)
       )
    
df['Type_of_Question'] = np.select([num>5, num>0],
                                   ['10 point scale', '5 point scale'],
                                  np.nan)

Output:

                                                                                      Question_Text Type_of_Question
0                          on a scale of 1 â€“ 10 how well would you rate the following statements.   10 point scale
1                           on a scale of 1 to 10 how well would you rate the following statements.   10 point scale
2                              on a scale of 1-10 how well would you rate the following statements.   10 point scale
3                              on a scale of 1 10 how well would you rate the following statements.   10 point scale
4                           on a scale of 1 â€“ 5 how well would you rate the following statements.    5 point scale
5                            on a scale of 1 to 5 how well would you rate the following statements.    5 point scale
6                               on a scale of 1-5 how well would you rate the following statements.    5 point scale
7                               on a scale of 1 5 how well would you rate the following statements.    5 point scale
8  please tell us how ready you feel for this (0 - 6 not ready, 6-8 somewhat ready, and 9-10 ready)   10 point scale
9                                                    how useful did you find the todayâ€™s webinar?              nan

Thanks for sharing this. But if I am not wrong this will not work if my question_text column has a string that has no numbers in it. I have edited my input dataframe. Have a look at the last row. In such a scenario I am happy if nothing is returned. Can something be done? — Django0602, Jan 10 '23 at 23:24

score 0 · Answer 2 · answered Jan 10 '23 at 21:06

You can check to see if 5 or 10 exists in your column and create the new column based on that:

df["Type_of_Question"] = df.Question_Text.apply(lambda x: "10 point scale" if "10" in x else "5 point scale")
print(df)

                               

    Question_Text Type_of_Question
0  on a scale of 1 â€“ 10 how well would you rate...   10 point scale
1  on a scale of 1 to 10 how well would you rate ...   10 point scale
2  on a scale of 1-10 how well would you rate the...   10 point scale
3  on a scale of 1 10 how well would you rate the...   10 point scale
4  on a scale of 1 â€“ 5 how well would you rate ...    5 point scale
5  on a scale of 1 to 5 how well would you rate t...    5 point scale
6  on a scale of 1-5 how well would you rate the ...    5 point scale
7  on a scale of 1 5 how well would you rate the ...    5 point scale
8  please tell us how ready you feel for this (0 ...   10 point scale

Using pandas to identify which are the 5 point and 10 point scale survey questions in a dataframe

2 Answers2