1

The results of an opinion pool are stored in CSV format in which questions with multiple answers are coded with 0 and 1. For example, if the question "which is your favorite color?" is asked to 100 people, and possible answers are limited to red, blue, green, a result such as

colors = ['red', 'blue', 'green']
votes = [33, 57, 10]

will be stored by means of three columns: one for 'red' answer containing 33 1's and 67 0's, one for 'blue' containing 57 1's and 43 0's, and one for 'green' containing 10 1's and 90 0's.

I would like to transform my data frame so that these three colums are merged into a unique column containing occurences of 'red', 'blue' and 'green'.

Here is a simplified example of my problem:

from pandas import DataFrame

actual_pool = {'foo': [0, 1, 2, 3, 4, 5, 6, 7],
               'red': [1, 0, 0, 1, 0, 1, 0, 0],
               'blue': [0, 1, 0, 0, 0, 0, 1, 1],
               'green': [0, 0, 1, 0, 1, 0, 0, 0],
               'bar': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']}

actual_df = DataFrame(actual_pool,columns= ['foo', 'red', 'blue', 'green', 'bar'])

expected_pool = {'foo': [0, 1, 2, 3, 4, 5, 6, 7], 
                 'colors': ['red', 'blue', 'green', 'red', 'green', 'red', 'blue', 'blue'],
                 'bar': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']}

expected_df = DataFrame(expected_pool,columns= ['foo', 'colors', 'bar'])

print(actual_df)
print(expected_df)

The question is How can I get expected_df from actual_df with pandas?.

EDIT 1: add columns 'foo' and 'bar' because I am interested in merging a subsets of columns of data frames.

Aleph
  • 1,343
  • 1
  • 12
  • 27

2 Answers2

1

Use DataFrame.idmax to get column name with max value :

actual_pool = {'red': [1, 0, 0, 1, 0, 1, 0, 0],
               'blue': [0, 1, 0, 0, 0, 0, 1, 1],
               'green': [0, 0, 1, 0, 1, 0, 0, 0]}

actual_df = DataFrame(actual_pool,columns= ['red', 'blue', 'green'])

actual_df['colors'] = actual_df.idxmax(axis=1)

print(actual_df)

Output:

  red  blue  green   colors                                                                                                            
0    1     0      0    red                                                                                                            
1    0     1      0   blue                                                                                                            
2    0     0      1  green                                                                                                            
3    1     0      0    red                                                                                                            
4    0     0      1  green                                                                                                            
5    1     0      0    red                                                                                                            
6    0     1      0   blue                                                                                                            
7    0     1      0   blue    

If you have extra columns

from pandas import DataFrame

actual_pool = {'red': [1, 0, 0, 1, 0, 1, 0, 0],
               'blue': [0, 1, 0, 0, 0, 0, 1, 1],
               'green': [0, 0, 1, 0, 1, 0, 0, 0],
                'pink': [12,0,11,2,0,90,0,12]
}

actual_df = DataFrame(actual_pool,columns= ['red', 'blue', 'green'])

actual_df['colors'] = actual_df[['red','blue','green']].idxmax(axis=1)

print(actual_df)
Sociopath
  • 13,068
  • 19
  • 47
  • 75
  • Wow, I am impressed! Does this solution can be applied to a subset of contiguous columns in a DataFrame? – Aleph Apr 15 '19 at 09:47
  • It takes into account all the numeric columns in your data and returns the name of the column with max value. – Sociopath Apr 15 '19 at 09:50
  • I fear my example is too simple. I will edit it to take this dimension of the problem into account. – Aleph Apr 15 '19 at 09:51
  • 1
    It can be applied even if you have large dataset, just use the columns from which you want to find max. – Sociopath Apr 15 '19 at 09:52
0

try:

small_df = actual_df[['red', 'blue', 'green']]
small_df.eq(1) @ small_df.columns

will give output

0      red
1     blue
2    green
3      red
4    green
5      red
6     blue
7     blue
dtype: object
Akhilesh_IN
  • 1,217
  • 1
  • 13
  • 19