3

This question provided a example of how a frequency count for a given row can be obtained from pandas dataframe using pd.get_dummies + aggregation. However this doesn't scale if you want only a small subset of terms from a very large dataframe.

For example consider the same example:

import pandas as pd

df = pd.DataFrame({'ID': ['xyz_1', 'xyz_2', 'xyz_3', 'xyz_4', 'xyz_400'],
                   'class1': ['yes_1', 'no_2', pd.NA, 'no_3', 'no_7'],
                   'class2': ['no_8', 'yes_15', 'yes_16', 'no_18', 'no_21'],
                   'class3': [pd.NA, 'no_51', 'yes_1', 'no_3', 'no_4'],
                   'class100': ['yes_3', 'no_5', pd.NA, 'yes_6', 'no_7']})

        ID class1  class2 class3 class100
0    xyz_1  yes_1    no_8   <NA>    yes_3
1    xyz_2   no_2  yes_15  no_51     no_5
2    xyz_3   <NA>  yes_16  yes_1     <NA>
3    xyz_4   no_3   no_18   no_3    yes_6
4  xyz_400   no_7   no_21   no_4     no_7

Instead of the values being in the set of yes and no they can be one of many different categorical variables. If you just wanted frequency terms for yes_1, no_51 this is a ton of extra computation.

The best solution I've found so far is pre-processing the other values into NANs

set = ['yes_1', 'no_51']
df[~df.isin(set)] = pd.NA

     ID class1 class2 class3 class100
0  <NA>  yes_1   <NA>   <NA>     <NA>
1  <NA>   <NA>   <NA>  no_51     <NA>
2  <NA>   <NA>   <NA>  yes_1     <NA>
3  <NA>   <NA>   <NA>   <NA>     <NA>
4  <NA>   <NA>   <NA>   <NA>     <NA>

This is still glacially slow for large dataframes on the order of 1 million entries. Is there a way to scale this better.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
knowads
  • 705
  • 2
  • 7
  • 24
  • Have you tried [df.replace()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)? See if that gives you any improvement. In my experience pandas and numpy functions scale well. If you still need more performance, and the dataframe can be broken up temporarily, you might look into [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) – wifl Oct 09 '20 at 23:50

3 Answers3

1

I dont know if it works better than your technic or not but I propose it as a solution to test:

(
    pd
    .melt(df,id_vars=['ID'])
    .assign(yes_1 = lambda x: np.where(x['value']=='yes_1',1,0))
    .assign(no_51 = lambda x: np.where(x['value']=='no_51',1,0))
    .sum()
)
Mehdi Golzadeh
  • 2,594
  • 1
  • 16
  • 28
  • Thanks for the contribution but ideally I'd like to blackbox whatever the set of acceptable values are. – knowads Oct 10 '20 at 01:35
1

In the linked question, a lot more performant solution would be:

df.apply(lambda row: row.value_counts(dropna=False), axis=1).fillna(0)

This might be already sufficient for your purposes; however, in case you're after only a couple values, it might be faster to just:

counts = pd.Series({(df == key).values.sum() for key in ['yes_1', 'no_51']}) 
Marat
  • 15,215
  • 2
  • 39
  • 48
1
df.set_index('ID', inplace=True)#Set ID as index
df[~df.isin(['yes_1', 'no_51'])] = np.nan#Set anything not in the set as nan
pd.get_dummies(df.stack().unstack())#get dummies from a datframe that has dropped anycolumns with NaNS

      

              class1_yes_1  class3_no_51  class3_yes_1
ID                                             
xyz_1             1             0             0
xyz_2             0             1             0
xyz_3             0             0             1
wwnde
  • 26,119
  • 6
  • 18
  • 32