How to make a column with grouped value count and respect on its existing columns

Question

I have the following data table, and I want to get the count by putting some conditions on the existing columns, It'll be really great help if I can get the solution for the same.

Input:

   Key1    id1-age     id2-age     id3-age    id4-age   id5-age  id1-gender id2-gender   id3-gender    id4-gender    id5-gender
0   a          6          32          61         22       23         M       F               M               F           F
1   b         36          25          52         16       33         M       M               F               F           M
2   c         12          21          45         15       66         F       M               M               M           F

Problem Statement

A single key as multiple age of Individuals and Gender for that specific Key w.r.t. Age id, & I want make columns which will have counts of age groups for each row in python w.r.t. to its gender.

following output is expected:

Output Expected:

      Key1  id1-age id2-age id3-age id4-age id5-age  age(02-15)  age(16-21)  age(21-30)  age(31-40) age(41-50)   age(51-60)  age(61+)
0      a     6        32       61     22      23       1            0            2          1        0               0        1
1      b    36        25       52     16      33       0            1            1          2        0               1        0
2      c    12        21       45     15      66       2            1            0          0        1               0        1

I hope I'm able to give the proper explanation to my problem statement. waiting for the positive responses Thanks in Advance

score 2 · Answer 1 · answered Jul 13 '20 at 13:11

You can search through the columns and count the age groups for each row. Then the counted values can be stored in separate lists which will be added to the dataframe after traversing each row.

Here is my approach. It is not the shortest code and it can be improved.

import pandas as pd

df = pd.DataFrame([['a', 6, 32, 61, 22, 23],
                   ['b', 36, 25, 52, 16, 33],
                   ['c', 12, 21, 45, 15, 66],                   
                   ],
                  columns=['Key1', 'id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age'])

age_15 = []
age_21 = []
age_30 = []
age_40 = []
age_50 = []
age_60 = []
age_61 = []

for index, record in df.iterrows():
    search_columns = ['id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age']
    count_15 = 0
    count_21 = 0
    count_30 = 0
    count_40 = 0
    count_50 = 0
    count_60 = 0
    count_61 = 0
    for search_column in search_columns:
        age = record[search_column]
        if age>=2 and age <= 15:
            count_15 += 1
        elif age>=16 and age <= 21:
            count_21 += 1
        elif age>21 and age <= 30:
            count_30 += 1
        elif age>=31 and age <= 40:
            count_40 += 1
        elif age>=41 and age <= 50:
            count_50 += 1
        elif age>=51 and age <= 60:
            count_60 += 1
        elif age>=61:
            count_61 += 1                
    age_15.append(count_15)
    age_21.append(count_21)
    age_30.append(count_30)
    age_40.append(count_40)
    age_50.append(count_50)
    age_60.append(count_60)
    age_61.append(count_61)

df['age(02-15)'] = age_15
df['age(16-21)'] = age_21
df['age(21-30)'] = age_30
df['age(31-40)'] = age_40
df['age(41-50)'] = age_50
df['age(51-60)'] = age_60
df['age(61+)'] = age_61
print(df[['age(02-15)', 'age(16-21)', 'age(21-30)', 'age(31-40)', 'age(41-50)', 'age(51-60)', 'age(61+)']])

Output:

   age(02-15)  age(16-21)  age(21-30)  age(31-40)  age(41-50)  age(51-60)  age(61+)
0           1           0           2           1           0           0         1
1           0           1           1           2           0           1         0
2           2           1           0           0           1           0         1

can you please provide updated solution for the above question? @arsho — Anamta sayyed, Jul 14 '20 at 08:07
Your question is radically changed but you can still update my code according to the new requirements. You need to add the new values (`id1-gender`, ..., 'id5-gender') in `df` and update the logic in the loop. If you get stuck, I will assist. — arshovon, Jul 14 '20 at 17:25

vahdet · Answer 2 · 2020-07-13T13:59:49.000

There might be less verbose solutions, but applying conditional sum across your columns [1,5) and assigning them to new columns, something as follows should help:

import pandas as pd
df = pd.DataFrame({
  'Key1': ['a', 'b', 'c'],
  'id1-age': [6, 36, 12],
  'id2-age': [32, 25, 12],
  'id3-age': [61, 52, 45],
  'id4-age': [22, 16, 15],
  'id5-age': [23, 33, 66]
})

df['age(02-15)'] = ((df[df.columns[1:5]] >= 2) & (df[df.columns[1:5]] < 15)).sum(1)
df['age(16-21)'] = ((df[df.columns[1:5]] >= 16) & (df[df.columns[1:5]] < 21)).sum(1)
df['age(21-30)'] = ((df[df.columns[1:5]] >= 21) & (df[df.columns[1:5]] < 30)).sum(1)
df['age(31-40)'] = ((df[df.columns[1:5]] >= 31) & (df[df.columns[1:5]] < 40)).sum(1)
df['age(41-50)'] = ((df[df.columns[1:5]] >= 41) & (df[df.columns[1:5]] < 50)).sum(1)
df['age(51-60)'] = ((df[df.columns[1:5]] >= 51) & (df[df.columns[1:5]] < 60)).sum(1)
df['age(61+)'] = (df[df.columns[1:5]] >= 61).sum(1)

print(df)

And if you favor column names list instead of indices range, you can replace df.columns[1:5]s with ['id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age'] and even define it as a variable to avoid duplicating it over and over again. Then, it may become:

import pandas as pd
df = pd.DataFrame({
  'Key1': ['a', 'b', 'c'],
  'id1-age': [6, 36, 12],
  'id2-age': [32, 25, 12],
  'id3-age': [61, 52, 45],
  'id4-age': [22, 16, 15],
  'id5-age': [23, 33, 66]
})

range_cols = df[['id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age']]

df['age(02-15)'] = ((range_cols >= 2) & (range_cols < 15)).sum(1)
df['age(16-21)'] = ((range_cols >= 16) & (range_cols < 21)).sum(1)
df['age(21-30)'] = ((range_cols >= 21) & (range_cols < 30)).sum(1)
df['age(31-40)'] = ((range_cols >= 31) & (range_cols < 40)).sum(1)
df['age(41-50)'] = ((range_cols >= 41) & (range_cols < 50)).sum(1)
df['age(51-60)'] = ((range_cols >= 51) & (range_cols < 60)).sum(1)
df['age(61+)'] = (range_cols >= 61).sum(1)

print(df)

What can I do if want to mention selected column names instead column Index? @vahdet — Anamta sayyed, Jul 13 '20 at 13:27
@Anamtasayyed I added a second snippet for utilizing column names rather than indices. — vahdet, Jul 13 '20 at 14:00
can you please provide updated solution for the above question? @vahdet — Anamta sayyed, Jul 14 '20 at 08:06
It is not sustainable to incrementally change your question and bring new cases into play. Your original question already required a full code solution (and might be flagged to be too wide). So, you should try out your new case, and if you're stuck; ask a new question from scratch. — vahdet, Jul 14 '20 at 08:50

score 0 · Answer 3 · answered Jul 13 '20 at 13:14

0

you can use pandas.cut() given that your dataframe is called df , like this

df.apply(lambda r : pd.cut(r,[15,21,31,41,61,1000]).value_counts() , axis = 1)

and then merge the dataframes

answered Jul 13 '20 at 13:14

dantepawn

51
1
3

How to make a column with grouped value count and respect on its existing columns

3 Answers3