0

First, I appreciate any help that you can provide. I am getting a bit familiar with pandas but I still lack a lot of knowledge. The solution for this is probably easy, but I have been stuck for some time, so please help me if possible.

My current dataframe is way bigger with a ton of more options, but to make it simpler, let's say I have a dataframe that looks like this:

enter image description here

I will add a code to recreate the dataframe at the end.

Each subject answered three questions each day for two days, and I would like to know the number of correct or incorrect responses per day, per subject. I know I can achieve that by doing: n_resp=df.groupby(['Subject','Day','Response']).count()

However, in that final result, the Group column is replaced by the result from count().I would like to conserve the 'Group' column because I will use it to create a plot that will differentiate between the two. How can I achieve that?

I tried adding 'Group' to the groupby, but the result comes back empty: n_resp=df.groupby(['Group','Subject','Day','Response']).count() -> Dataframe with 4 indexed columns and 0 values.

I also tried to index the column that I don't need for groupby and then groupby, but it comes back empty and the indexed column is substituted by the groupby cols: n_resp=df.set_index('Group').groupby(['Subject','Day','Response']).count()

Finally, I tried the groupby options as_index=False and group_keys=False after adding all the cols to groupby but it didn't work either.

I have also tried to look for it online, but don't know exactly how to. I am probably misunderstanding the basics of how index and groupby work in DataFrames, so if that's it, please explain or direct me to a guide for it. I have read the groupby webpage but I didn't find any example that could help me.

Thank you for any help.

SOLUTION: Provided by @jezrael in comments and based on this response.

The suggested dupe question requires to groupby all the columns, even if not needed. This solution doesn't require that:

n_resp=df.copy() # To avoid modifying original DataFrame
n_resp['count'] = n_resp.groupby(['Subject','Day','Response'])['Response'].transform('count') # Create new column where count will be stored
n_resp.drop_duplicates(['Subject','Day','Response'], inplace=True) # Remove duplicated rows

This is how the result looks (might vary depending on randomized responses):

enter image description here

And here is the code to recreate the simplified DataFrame:

import pandas as pd
import random as rand

def rand_array_string(str_array,len_database):
    rand_array=['']*len_database
    for i in range(len_database):
        rand_array[i]=rand.choice(str_array)
    return rand_array
    
def repeat_strings(str_array,len_database):
    n_rep=int(len_database/len(str_array))
    rep_array=['']*len_database
    pos=0
    for i in range(n_rep):
        for string in str_array:
            rep_array[pos]=string
            pos+=1
    return rep_array

length=24

resp=rand_array_string(['Right','Wrong'], length)
subs=sorted(repeat_strings(['S1','S2','S3','S4'],length))
days=[str(n) for n in range(1,3)]
days=(sorted(days*3))*4
group=sorted(repeat_strings(['Young','Old'], length))

df = pd.DataFrame({'Group':group,
                   'Subject':subs,
                   'Day':days,
                   'Response':resp})
Zaida
  • 13
  • 2
  • Use `n_resp=df.groupby(['Group','Subject','Day','Response']).size().reset_index(name='count')` – jezrael Dec 07 '22 at 05:54
  • @jezrael Thank you for the response, but as I said, the example here is simplified. My dataframe has more columns that I would like to preserve, and I guess I could list them all, but would prefer not to. I found a response that could help with that here: https://stackoverflow.com/a/55754639/18255654 but I do not understand completely what some parts of it mean. – Zaida Dec 07 '22 at 06:04
  • So need `df['count'] = df.groupby(['Group','Subject','Day','Response'])['Response'].transform('count')` ? What is expected ouput from sample data? – jezrael Dec 07 '22 at 06:06
  • 1
    It would be this: https://i.stack.imgur.com/ECHft.png I got that by doing this: `n_resp=df.copy()`, `n_resp['Count']=n_resp.groupby(['Subject','Day','Response'],sort=False)['Subject'].transform('size')` and then `n_resp.drop_duplicates(['Subject','Day','Response'], inplace=True)`. Through that code, I don't need to write all of the columns that I want to keep, but I don't understand why the need to sort or what the ['Subject'] after groupby means (taking it off prompts an stdin error). – Zaida Dec 07 '22 at 06:20
  • 1
    @jezrael Your example works alright! I just had to add the drop duplicates, and that way I don't need to write all the columns to keep. Thank you very much! – Zaida Dec 07 '22 at 06:25
  • @jezrael Just an extra question if you have time to answer. Why is the `['Response']` in your suggestion necessary? I know it is needed as it doesn't work otherwise and it sorts the result depending on the column I put in, but I don't understand the reason why it needs to be there. Thank you for your time! – Zaida Dec 07 '22 at 06:42
  • Because `transform` if no specify column after groupby get counts by all columns, try `print (n_resp.groupby(['Subject','Day','Response'],sort=False).transform('size'))` - all columns exclude `Subject','Day','Response` are converted to new columns with same values – jezrael Dec 07 '22 at 06:45

0 Answers0