Smart merging of columns of multiple choices with pandas' data frames

Question

The results of an opinion pool are stored in CSV format in which questions with multiple answers are coded with 0 and 1. For example, if the question "which is your favorite color?" is asked to 100 people, and possible answers are limited to red, blue, green, a result such as

colors = ['red', 'blue', 'green']
votes = [33, 57, 10]

will be stored by means of three columns: one for 'red' answer containing 33 1's and 67 0's, one for 'blue' containing 57 1's and 43 0's, and one for 'green' containing 10 1's and 90 0's.

I would like to transform my data frame so that these three colums are merged into a unique column containing occurences of 'red', 'blue' and 'green'.

Here is a simplified example of my problem:

from pandas import DataFrame

actual_pool = {'foo': [0, 1, 2, 3, 4, 5, 6, 7],
               'red': [1, 0, 0, 1, 0, 1, 0, 0],
               'blue': [0, 1, 0, 0, 0, 0, 1, 1],
               'green': [0, 0, 1, 0, 1, 0, 0, 0],
               'bar': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']}

actual_df = DataFrame(actual_pool,columns= ['foo', 'red', 'blue', 'green', 'bar'])

expected_pool = {'foo': [0, 1, 2, 3, 4, 5, 6, 7], 
                 'colors': ['red', 'blue', 'green', 'red', 'green', 'red', 'blue', 'blue'],
                 'bar': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']}

expected_df = DataFrame(expected_pool,columns= ['foo', 'colors', 'bar'])

print(actual_df)
print(expected_df)

The question is How can I get expected_df from actual_df with pandas?.

EDIT 1: add columns 'foo' and 'bar' because I am interested in merging a subsets of columns of data frames.

Sociopath · Accepted Answer · 2019-04-15T09:54:43.917

Use DataFrame.idmax to get column name with max value :

actual_pool = {'red': [1, 0, 0, 1, 0, 1, 0, 0],
               'blue': [0, 1, 0, 0, 0, 0, 1, 1],
               'green': [0, 0, 1, 0, 1, 0, 0, 0]}

actual_df = DataFrame(actual_pool,columns= ['red', 'blue', 'green'])

actual_df['colors'] = actual_df.idxmax(axis=1)

print(actual_df)

Output:

  red  blue  green   colors                                                                                                            
0    1     0      0    red                                                                                                            
1    0     1      0   blue                                                                                                            
2    0     0      1  green                                                                                                            
3    1     0      0    red                                                                                                            
4    0     0      1  green                                                                                                            
5    1     0      0    red                                                                                                            
6    0     1      0   blue                                                                                                            
7    0     1      0   blue

If you have extra columns

from pandas import DataFrame

actual_pool = {'red': [1, 0, 0, 1, 0, 1, 0, 0],
               'blue': [0, 1, 0, 0, 0, 0, 1, 1],
               'green': [0, 0, 1, 0, 1, 0, 0, 0],
                'pink': [12,0,11,2,0,90,0,12]
}

actual_df = DataFrame(actual_pool,columns= ['red', 'blue', 'green'])

actual_df['colors'] = actual_df[['red','blue','green']].idxmax(axis=1)

print(actual_df)

Wow, I am impressed! Does this solution can be applied to a subset of contiguous columns in a DataFrame? — Aleph, Apr 15 '19 at 09:47
It takes into account all the numeric columns in your data and returns the name of the column with max value. — Sociopath, Apr 15 '19 at 09:50
I fear my example is too simple. I will edit it to take this dimension of the problem into account. — Aleph, Apr 15 '19 at 09:51
It can be applied even if you have large dataset, just use the columns from which you want to find max. — Sociopath, Apr 15 '19 at 09:52

score 0 · Answer 2 · answered Apr 15 '19 at 10:10

0

try:

small_df = actual_df[['red', 'blue', 'green']]
small_df.eq(1) @ small_df.columns

will give output

0      red
1     blue
2    green
3      red
4    green
5      red
6     blue
7     blue
dtype: object

answered Apr 15 '19 at 10:10

Akhilesh_IN

1,217
1
13
19

Smart merging of columns of multiple choices with pandas' data frames

2 Answers2