1

My ultimate goal is one-hot-encoding on a Pandas column. In this case, I want to one-hot-encode column "b" as follows: keep apples, bananas and oranges, and encode any other fruit as "other".

Example: in the code below "grapefruit" will be re-written as "other", as would "kiwi"s and "avocado"s if they appeared in my data.

This code below works:

df = pd.DataFrame({
    "a": [1,2,3,4,5],
    "b": ["apple", "banana", "banana", "orange", "grapefruit"],
    "c": [True, False, True, False, True],
})
print(df)

def analyze_fruit(s):
    if s in ("apple", "banana", "orange"):
        return s
    else:
        return "other"

df['b'] = df['b'].apply(analyze_fruit)

df2 = pd.get_dummies(df['b'], prefix='b')
print(df2)

My question: is there a shorter way to do the analyze_fruit() business? I tried DataFrame.replace() with a negative lookahead assertion without success.

smci
  • 32,567
  • 20
  • 113
  • 146
jsf80238
  • 1,577
  • 2
  • 11
  • 24
  • `~df['b'].isin(['apple','banana','orange'])` is useful – smci Jul 31 '21 at 23:56
  • Duplicate of the linked question. Another way is [`sklearn.preprocessing.OneHotEncoder(..., drop=None)`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) also generates dummies. Just first, munge your data to create the "other" categorical level/column. – smci Aug 01 '21 at 00:15

1 Answers1

2

You can setup the Categorical before get_dummies then fillna anything that does not match the set categories will become NaN which can be easily filled by fillna. Another Benefit of the categorical is ordering can be defined here as well by adding ordered=True:

df['b'] = pd.Categorical(
    df['b'],
    categories=['apple', 'banana', 'orange', 'other']
).fillna('other')

df2 = pd.get_dummies(df['b'], prefix='b')

Standard replacement with something like np.where would also work here, but typically dummies are used with Categorical data so being able to add ordering so the dummy columns appear in a set order can be helpful:

# import numpy as np


df['b'] = np.where(df['b'].isin(['apple', 'banana', 'orange']),
                   df['b'],
                   'other')

df2 = pd.get_dummies(df['b'], prefix='b')

Both produce df2:

   b_apple  b_banana  b_orange  b_other
0        1         0         0        0
1        0         1         0        0
2        0         1         0        0
3        0         0         1        0
4        0         0         0        1
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • 1
    HenryEcker: yes sorry my brain sprang a leak. Yes this is correct. – smci Aug 01 '21 at 00:05
  • 1
    By the way, if you read in column 'b' from pd.read_csv, you can directly read the fruit column 'b' as a Categorical (then each fruit will get its own categorical level. You then postprocess it to munge the forbidden fruit together into one "other" level). – smci Aug 01 '21 at 00:10
  • Actually, [`sklearn.preprocessing.OneHotEncoder(..., drop=None)`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) also generates dummies. – smci Aug 01 '21 at 00:13