Get the highest occurrence between 2 columns

Question

I have 2 columns, called decision1 and decision2.

I want to compare them in order to get the highest occurrence between the 2, so I either get the highest occurrence in decision1 or decision2, according to the greatest one. So far my attempts led to this, but with no success since I just get the highest occurrence in EACH column and not combined

 # weightage option
if args['weightage'] == "yes":
    attr1 = data['decision'].value_counts().idxmax  #highest occurrence in decision
    attr2 = data['decision2'].value_counts().idxmax #highest occurrence in decision2
    heaviest_attribute = data.groupby(['decision','decision2']).size()

Ideally I would just need to use some kind of max() function between attr1 and attr2 but I don't know how to handle this.

For example, given this table

I want to compare decision1 and decision2 columns as if they were one column, and the expected output, in this case, would be 'Yes', because it's the most recurrent value.

Do you think `data.groupby(['decision','decision2']).size().idxmax` ? — jezrael, Dec 10 '20 at 08:06
Sample input output is much appreciated to better understand your problem — Srivatsav Raghu, Dec 10 '20 at 08:13
Thanks for your feedbacks, I added a sample table as reference — Andre Retro Pie, Dec 10 '20 at 08:14

Srivatsav Raghu · Answer 1 · 2020-12-10T08:50:24.790

1

Here's a simple solution.

Its best to convert the content into a list and finding max occurance in list is simple.

import pandas as pd
data = pd.DataFrame({'decision': ['yes', 'maybe', 'yes', 'maybe', 'yes'], 
               'decision 2': ['No', 'No', 'Perhaps', 'Perhaps', 'unsure']
            })
a = list(data['decision'])+list(data['decision 2'])
a = max(set(a), key=a.count)
print(a)

Output:

yes

edited Dec 10 '20 at 08:50

answered Dec 10 '20 at 08:34

Srivatsav Raghu

399
4
11

2

This is returning "yes" because the `max` function takes the last alphabetical letters as the greatest value. Try `max(['yes','yes','maybe', 'z'])` and you'll see the issue. – Cainã Max Couto-Silva Dec 10 '20 at 08:45
1

Yes, you are right. I've updated my code now. Thanks for pointing out. – Srivatsav Raghu Dec 10 '20 at 08:51

jezrael · Accepted Answer · 2020-12-10T09:22:36.070

1

Use DataFrame.melt with Series.mode and select first value by position with Series.iat:

a = df[['decision','decision 2']].melt()['value'].mode().iat[0]

Or reshape by DataFrame.stack:

a = df[['decision','decision 2']].stack().mode().iat[0]

print (a)
Yes

Detail:

print (df[['decision','decision 2']].melt()['value'])
0        Yes
1      Maybe
2        Yes
3      Maybe
4        Yes
5         No
6         No
7    Perhaps
8    Perhaps
9     unsure
Name: value, dtype: object

print (df[['decision','decision 2']].stack())
0  decision          Yes
   decision 2         No
1  decision        Maybe
   decision 2         No
2  decision          Yes
   decision 2    Perhaps
3  decision        Maybe
   decision 2    Perhaps
4  decision          Yes
   decision 2     unsure
dtype: object

EDIT:

s = df.eq(a).any()

col = s.index[s][0]
print (col)
decision

edited Dec 10 '20 at 09:22

answered Dec 10 '20 at 08:38

jezrael

822,522
95
1,334
1,252

Thanks, this works like a charm (especially `a = df[['decision','decision 2']].stack().mode().iat[0]`). One bonus question though: by getting this value, assuming each column has its own values so in our case 'Yes' will always be in decision, and never in decision2, can we retrieve the column name by using the value? – Andre Retro Pie Dec 10 '20 at 09:18
@AndreRetroPie - I think I understand, answer was edited. – jezrael Dec 10 '20 at 09:22
Thanks sir, you're my hero – Andre Retro Pie Dec 10 '20 at 09:33

yxteh · Answer 3 · 2020-12-10T08:31:39.893

0

Probably there exist a much more elegant solution...

df = pd.DataFrame({'decision': ['Yes', 'Maybe', 'Yes', 'Maybe', 'Yes'], 
                   'decision 2': ['No', 'No', 'Perhaps', 'Perhaps', 'unsure']})

d1 = dict(df.groupby('decision').count().loc[:, 'decision 2'])
d2 = dict(df.groupby('decision 2').count().loc[:, 'decision'])

d1.update(d2)

max(d1, key=d1.get)

edited Dec 10 '20 at 08:31

answered Dec 10 '20 at 08:25

yxteh

126
1
8

Get the highest occurrence between 2 columns

3 Answers3