I have the following pandas DataFrame, with only three columns:
import pandas pd
dict_example = {'col1':['A', 'A', 'A', 'A', 'A'],
'col2':['A', 'B', 'A', 'B', 'A'], 'col3':['A', 'A', 'A', 'C', 'B']}
df = pd.DataFrame(dict_example)
print(df)
col1 col2 col3
0 A A A
1 A B A
2 A A A
3 A B C
4 A A B
For the rows with differing elements, I'm trying to write a function which will return the column names of the "minority" elements.
As an example, in row 1, there are 2 A's and 1 B. Given there is only one B, I consider this the "minority". If all elements are the same, there's naturally no minority (or majority). However, if each column has a different value, I consider these columns to be minorities.
Here is what I have in mind:
col1 col2 col3 min
0 A A A []
1 A B A ['col2']
2 A A A []
3 A B C ['col1', 'col2', 'col3']
4 A A B ['col3']
I'm stumped how to computationally efficiently calculate this.
Finding the maximum number of items appears straightfoward, either with using pandas.DataFrame.mode()
or one could find the maximum item in a list as follows:
lst = ['A', 'B', 'A']
max(lst,key=lst.count)
But I'm not sure how I could find either the least occurring items.