4

I am quite new to Pandas and I am trying to do the following thing:

I have two dataframes comms and arts that look like this (except for the fact they are longer ad with other columns)

comms:

ID    commScore           
10       5                
10       3                  
10      -1                 
11       0                
11       2              
12       9      
13      -2     
13      -1     
13       1      
13       4

arts:

ID    commNumber
10        3 
11        2    
12        1
13        4      

I need to group comms by their ID, and then save in arts (obviously in the correct ID line) the interquartile range (IQR) of the commScore distribution for each ID.

I've already tried using groupby, agg and map , but since my notions of pandas are quite limited, I just couldn't do what I was looking for.

Does anyone have a solution?

Thanks

Andrea

Sala
  • 480
  • 4
  • 19

2 Answers2

9

We can group the dataframe by ID and aggregate column commScore using the function iqr from scipy.stats to calculate inter quartile range, then map this calculated iqr range on the column ID of the arts dataframe

from scipy.stats import iqr

arts['IQR'] = arts['ID'].map(comms.groupby('ID')['commScore'].agg(iqr))

   ID  commNumber  IQR
0  10           3    3
1  11           2    1
2  12           1    0
3  13           4    3
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
3

The GroupBy object has a quantile method. You can calculate Q3 and Q1, and subtract them. Some renaming and joining follow it:

grouper = comms.groupby("ID")
q1, q3 = grouper.quantile(0.25), grouper.quantile(0.75)
iqr = q3 - q1
iqr = iqr.rename(columns={"commScore": "IQR"})

arts = arts.set_index("ID").join(iqr)

to get

>>> arts

    commNumber  IQR
ID
10           3  3.0
11           2  1.0
12           1  0.0
13           4  3.0

If there is more than one numeric column, then we go explicit about commScore as:

grouper = comms.groupby("ID").commScore
q1, q3 = grouper.quantile(0.25), grouper.quantile(0.75)
iqr = q3 - q1
iqr.name = "IQR"  # `iqr` will be a series since we selected 1 column,
                  #  so renaming is a bit different

arts = arts.set_index("ID").join(iqr)

which gives the same result.

If you don't want to call quantile 2 times, you can pass a list [0.75, 0.25] and then subtract them with agg. So instead of 2 lines above involving q1 and q3, we write

iqr = grouper.quantile([0.75, 0.25]).groupby("ID").agg(np.subtract.reduce)

and the rest is the same.

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • Thanks for your answer. But how does python understand that I want to compute the quantile for the ```commScore```variable in line 2 of your code? – Sala Apr 22 '21 at 14:50
  • @Sala It applies to all the numeric columns left after grouping by ID. In this case, there is only `commScore`. – Mustafa Aydın Apr 22 '21 at 15:11
  • Actually I have more than one numeric column that I didn't report here – Sala Apr 22 '21 at 15:14
  • @Sala I edited to account for that if it helps. – Mustafa Aydın Apr 22 '21 at 15:21
  • Thanks, Mustafa. If I can ask, where did you learn about all this "advanced" use of pandas? – Sala Apr 22 '21 at 15:22
  • Oh, @Sala my knowledge is not advanced at all; if you are looking for advanced, the other answerer Shubham is the way to go :). But my two cents are: reading official documentation of Pandas, reading the top-rated and frequently visited answers here, trying to answer pandas questions here to practice, and of course learn from the great answers (e.g. Shubham's above answer; or any of his answers, really :)) should get us going. It takes time, of course. – Mustafa Aydın Apr 22 '21 at 15:29
  • 1
    Thaks for your detailed answer! From now on, I will try to do so. Cheers – Sala Apr 22 '21 at 15:31