My ultimate goal is one-hot-encoding on a Pandas column. In this case, I want to one-hot-encode column "b" as follows: keep apples, bananas and oranges, and encode any other fruit as "other".
Example: in the code below "grapefruit" will be re-written as "other", as would "kiwi"s and "avocado"s if they appeared in my data.
This code below works:
df = pd.DataFrame({
"a": [1,2,3,4,5],
"b": ["apple", "banana", "banana", "orange", "grapefruit"],
"c": [True, False, True, False, True],
})
print(df)
def analyze_fruit(s):
if s in ("apple", "banana", "orange"):
return s
else:
return "other"
df['b'] = df['b'].apply(analyze_fruit)
df2 = pd.get_dummies(df['b'], prefix='b')
print(df2)
My question: is there a shorter way to do the analyze_fruit()
business? I tried DataFrame.replace()
with a negative lookahead assertion without success.