4

I have the following pandas DataFrame, with only three columns:

import pandas pd
dict_example = {'col1':['A', 'A', 'A', 'A', 'A'], 
    'col2':['A', 'B', 'A', 'B', 'A'], 'col3':['A', 'A', 'A', 'C', 'B']}

df = pd.DataFrame(dict_example)

print(df)
  col1 col2 col3
0    A    A    A
1    A    B    A
2    A    A    A
3    A    B    C
4    A    A    B

For the rows with differing elements, I'm trying to write a function which will return the column names of the "minority" elements.

As an example, in row 1, there are 2 A's and 1 B. Given there is only one B, I consider this the "minority". If all elements are the same, there's naturally no minority (or majority). However, if each column has a different value, I consider these columns to be minorities.

Here is what I have in mind:

  col1 col2 col3   min
0    A    A    A   []
1    A    B    A   ['col2']
2    A    A    A   []
3    A    B    C   ['col1', 'col2', 'col3']
4    A    A    B   ['col3']

I'm stumped how to computationally efficiently calculate this.

Finding the maximum number of items appears straightfoward, either with using pandas.DataFrame.mode() or one could find the maximum item in a list as follows:

lst = ['A', 'B', 'A']
max(lst,key=lst.count)

But I'm not sure how I could find either the least occurring items.

EB2127
  • 1,788
  • 3
  • 22
  • 43

1 Answers1

3

This solution is not simple - but I could not think of a pandas native solution without apply, and numpy does not seemingly provide much help without the below complex number trick for inner-row uniqueness and value counts.


If you are not fixed on adding this min column, we can use some numpy tricks to nan out the non-least-occuring entries. First, given your dataframe we can make a numpy array of integers to help.

v = pd.factorize(df.stack())[0].reshape(df.shape)

v = pd.factorize(df.values.flatten())[0].reshape(df.shape)

(should be faster, as stack is unecessary)

Then, using some tricks for numpy row-wise unique elements (using complex numbers to mark elements as unique in each row, find the least occurring elements, and mask them in). This method is mostly from user unutbu used in several answers.

def make_mask(a):
    weight = 1j*np.linspace(0, a.shape[1], a.shape[0], endpoint=False)
    b = a + weight[:, np.newaxis]
    u, ind, c = np.unique(b, return_index=True, return_counts=True)
    b = np.full_like(a, np.nan, dtype=float)
    np.put(b, ind, c)
    m = np.nanmin(b, axis=1)
    # remove only uniques
    b[(~np.isnan(b)).sum(axis=1) == 1, :] = np.nan
    # remove lower uniques
    b[~(b == m.reshape(-1, 1))] = np.nan
    return b

m = np.isnan(make_mask(v))
df[m] = np.nan

Giving

  col1 col2 col3
0  NaN  NaN  NaN
1  NaN    B  NaN
2  NaN  NaN  NaN
3    A    B    C
4  NaN  NaN    B

Hopefully this achieves what you want in a performant way (say if this dataframe is quite large). If there is a faster way to achieve the first line (without using stack), I would imagine this is quite fast for even very large dataframes.

modesitt
  • 7,052
  • 2
  • 34
  • 64
  • 1
    Thanks! It is a large dataframe, so this operation should be performant. I'll give it a shot. – EB2127 Apr 22 '20 at 01:30
  • Actually, it looks like I've found a situation whereby it's not working as expected. I have a row with 4 A and 2 B, and it's shows me that there are no "minority" characters. I could share the data with you perhaps? – EB2127 Apr 22 '20 at 04:59
  • yeah @EB2127 I can take a look – modesitt Apr 22 '20 at 15:14
  • 1
    I'm not sure how to share the dataframe. The row is A, A, C, C, A, A – EB2127 Apr 22 '20 at 20:34