Pandas create dummy features for each string in a dictionary of lists

Question

Implementing the following logic for a feature engineering purpose. A simple approach is easy but wondering if there is a more efficient solution that anyone can think of. Ideas are appreciated if you don't feel like implementing the whole code!

Take this DataFrame and dictionary

import pandas as pd
random_animals = pd.DataFrame(
                {'description':['xdogx','xcatx','xhamsterx','xdogx'
                                ,'xhorsex','xdonkeyx','xcatx']
                })


cat_dict = {'category_a':['dog','cat']
            ,'category_b':['horse','donkey']}

We want to create a column/feature for each string in the dictionary AND for each category. 1 if string is contained in the description column 0 otherwise.

So the output for this toy example would look like:

  description  is_dog is_cat is_horse is_donkey is_category_a is_category_b
0       xdogx       1      0        0         0             1             0
1       xcatx       0      1        0         0             1             0    
2   xhamsterx       0      0        0         0             0             0
3       xdogx       1      0        0         0             1             0
4     xhorsex       0      0        1         0             0             1
5    xdonkeyx       0      0        0         1             0             1
6       xcatx       0      1        0         0             1             0

Simple approach would be iterating once for each output column required and running (for each column, just hardcoded is_dog here for simplicity)

random_animals['is_dog'] = random_animals['description'].str.contains('dog')*1

There can be an arbitrary number of strings and categories in the cat_dict so I am wondering if there is a way to do this otherwise.

Similar to https://stackoverflow.com/questions/46786211/counting-the-frequency-of-words-in-a-pandas-data-frame ? — Kyle, May 24 '18 at 20:44
Not really, as shown in the example above, we want to be adding whole columns of 0/1, not just a count of the keywords. — user4505419, May 24 '18 at 20:49
There are shortcuts if you just want to test for `category_a` or `category_b`. With your problem, as stated, I do not believe you can optimise much further than `pd.Series.str.contains` (within `pandas` technology). — jpp, May 24 '18 at 20:59

Antonio Luis Sombra · Answer 1 · 2018-05-25T21:34:49.917

Interesting problem. I coded what you want below, but there's problably a shorter way to do that:

#Creating the DataFrame with columns of zeros

names = [x[1:-1] for x in random_animals.description.unique()]
categories = list(cat_dict.keys())
columns = names + categories
df_names = pd.DataFrame(0, index=np.arange(len(random_animals)), 
columns=columns)
df = pd.concat([random_animals, df_names], axis = 1)

#Populating the Dataframe - Automating your solution

#For animal names
for i in range(len(df.columns)-1):
    df[df.columns[i+1]] = df['description'].str.contains(df.columns[i+1])*1

#For categories
if df.columns[i+1] in list(cat_dict.keys()):
    searchfor = cat_dict[df.columns[i+1]]
    df[df.columns[i+1]]= df['description'].str.contains('|'.join(searchfor))*1

#Finally renaming names pattern of columns from "dog" to "is_dog"...:

for column in df.columns:
 if column in names:
     column_new = "is_"+column
     df[column_new] = df[column]
     df = df.drop(column, axis =1)

score 2 · Accepted Answer · answered May 25 '18 at 21:50

Here is a vectorized method. The main observation is that random_animals.description.str.contains when applied to a string returns a Series of indicators, one for each row of random_animals.

Since random_animals.description.str.contains is itself a vectorized function, we can apply it to the collection of animals to obtain a full indicator matrix.

Finally, we can add categories by enforcing logic between different columns. This will likely be faster than checking for string inclusion multiple times.

import pandas as pd
random_animals = pd.DataFrame(
                {'description':['xdogx','xcatx','xhamsterx','xdogx'
                                ,'xhorsex','xdonkeyx','xcatx']
                })


cat_dict = {'category_a':['dog', 'cat']
            ,'category_b':['horse', 'donkey']}

# create a Series containing all individual animals (without duplicates)
animals = pd.Series([animal for v in cat_dict.values()
        for animal in v])

df = pd.DataFrame(
        animals.apply(random_animals.description.str.contains).T.values,
        index  = random_animals.description,
        columns = animals).astype(int)

for cat, animals in cat_dict.items():
    df[cat] = df[animals].any(axis=1).astype(int)

             # dog  cat  horse  donkey  category_a  category_b
# description
# xdogx          1    0      0       0           1           0
# xcatx          0    1      0       0           1           0
# xhamsterx      0    0      0       0           0           0
# xdogx          1    0      0       0           1           0
# xhorsex        0    0      1       0           0           1
# xdonkeyx       0    0      0       1           0           1
# xcatx          0    1      0       0           1           0

score 0 · Answer 3 · answered May 24 '18 at 21:39

0

You could extend the pandas DataFrame class and implement a lazy column evaluation where if the derived column does not exist, implement the logic and add it to the base class columns collection.

answered May 24 '18 at 21:39

SKG

1,432
2
13
23

Pandas create dummy features for each string in a dictionary of lists

3 Answers3