Implementing the following logic for a feature engineering purpose. A simple approach is easy but wondering if there is a more efficient solution that anyone can think of. Ideas are appreciated if you don't feel like implementing the whole code!
Take this DataFrame and dictionary
import pandas as pd
random_animals = pd.DataFrame(
{'description':['xdogx','xcatx','xhamsterx','xdogx'
,'xhorsex','xdonkeyx','xcatx']
})
cat_dict = {'category_a':['dog','cat']
,'category_b':['horse','donkey']}
We want to create a column/feature for each string in the dictionary AND for each category. 1 if string is contained in the description
column 0 otherwise.
So the output for this toy example would look like:
description is_dog is_cat is_horse is_donkey is_category_a is_category_b
0 xdogx 1 0 0 0 1 0
1 xcatx 0 1 0 0 1 0
2 xhamsterx 0 0 0 0 0 0
3 xdogx 1 0 0 0 1 0
4 xhorsex 0 0 1 0 0 1
5 xdonkeyx 0 0 0 1 0 1
6 xcatx 0 1 0 0 1 0
Simple approach would be iterating once for each output column required and running (for each column, just hardcoded is_dog here for simplicity)
random_animals['is_dog'] = random_animals['description'].str.contains('dog')*1
There can be an arbitrary number of strings and categories in the cat_dict
so I am wondering if there is a way to do this otherwise.