2

I'm a python newbie and getting a bit lost in how to transform my data.

Here's an example dataset:

import numpy as np
import pandas as pd
import random
random.seed(123)
df = pd.DataFrame({'pp': list(range(1, 11)), 'age': list(np.random.randint(1,9,10)*10), 'gender': list(np.random.randint(1,3,10)), 'yes/no': list(np.random.randint(0,2,10))})

>>> df
   pp  age  gender  yes/no
0   1   20       1       1
1   2   50       1       0
2   3   10       2       1
3   4   50       1       1
4   5   40       2       0
5   6   60       2       0
6   7   30       2       1
7   8   70       1       0
8   9   30       2       0
9  10   70       1       0

I want to create a three new columns within my dataframe which represent the ratio between my different variables, namely:

  • ratio between gender 1 and 2 per yes/no category,
  • ratio between all existing age groups per yes/no category,
  • ratio between age and gender combination per yes/no category

For the first example I got something working like this:

df.groupby(["gender", "yes/no"]).size()/df.groupby(["yes/no"]).size()

But I'd actually want to get the output values as a new column, one value per pp. Anyone know a neat way to do this?

Inkling
  • 469
  • 1
  • 4
  • 19

1 Answers1

1

Try to use this:

(df.groupby(["gender", "yes/no"]).size()/df.groupby(["yes/no"]).size()).rename('ratio').reset_index()

enter image description here

Hamzah
  • 8,175
  • 3
  • 19
  • 43
  • Thanks Phoenix, would you also know how to add the ratio per pp as a new column in the original df? – Inkling Apr 04 '22 at 08:38
  • @Inkling It is the same way as I did change only the gender to pp and rename('ratio') to rename('pp ratio') – Hamzah Apr 11 '22 at 10:28