First, I appreciate any help that you can provide. I am getting a bit familiar with pandas but I still lack a lot of knowledge. The solution for this is probably easy, but I have been stuck for some time, so please help me if possible.
My current dataframe is way bigger with a ton of more options, but to make it simpler, let's say I have a dataframe that looks like this:
I will add a code to recreate the dataframe at the end.
Each subject answered three questions each day for two days, and I would like to know the number of correct or incorrect responses per day, per subject. I know I can achieve that by doing:
n_resp=df.groupby(['Subject','Day','Response']).count()
However, in that final result, the Group column is replaced by the result from count()
.I would like to conserve the 'Group' column because I will use it to create a plot that will differentiate between the two. How can I achieve that?
I tried adding 'Group' to the groupby, but the result comes back empty:
n_resp=df.groupby(['Group','Subject','Day','Response']).count()
-> Dataframe with 4 indexed columns and 0 values.
I also tried to index the column that I don't need for groupby and then groupby, but it comes back empty and the indexed column is substituted by the groupby cols:
n_resp=df.set_index('Group').groupby(['Subject','Day','Response']).count()
Finally, I tried the groupby
options as_index=False
and group_keys=False
after adding all the cols to groupby but it didn't work either.
I have also tried to look for it online, but don't know exactly how to. I am probably misunderstanding the basics of how index and groupby work in DataFrames, so if that's it, please explain or direct me to a guide for it. I have read the groupby webpage but I didn't find any example that could help me.
Thank you for any help.
SOLUTION: Provided by @jezrael in comments and based on this response.
The suggested dupe question requires to groupby
all the columns, even if not needed. This solution doesn't require that:
n_resp=df.copy() # To avoid modifying original DataFrame
n_resp['count'] = n_resp.groupby(['Subject','Day','Response'])['Response'].transform('count') # Create new column where count will be stored
n_resp.drop_duplicates(['Subject','Day','Response'], inplace=True) # Remove duplicated rows
This is how the result looks (might vary depending on randomized responses):
And here is the code to recreate the simplified DataFrame:
import pandas as pd
import random as rand
def rand_array_string(str_array,len_database):
rand_array=['']*len_database
for i in range(len_database):
rand_array[i]=rand.choice(str_array)
return rand_array
def repeat_strings(str_array,len_database):
n_rep=int(len_database/len(str_array))
rep_array=['']*len_database
pos=0
for i in range(n_rep):
for string in str_array:
rep_array[pos]=string
pos+=1
return rep_array
length=24
resp=rand_array_string(['Right','Wrong'], length)
subs=sorted(repeat_strings(['S1','S2','S3','S4'],length))
days=[str(n) for n in range(1,3)]
days=(sorted(days*3))*4
group=sorted(repeat_strings(['Young','Old'], length))
df = pd.DataFrame({'Group':group,
'Subject':subs,
'Day':days,
'Response':resp})