4

Beginner here. I want to use one hot encoding on my data frame that has multiple categorical data in one column. My data frame looks something like this, although with more things in the column such that I can't just do it manually:

Title       column
Movie 1   Action, Fantasy
Movie 2   Fantasy, Drama
Movie 3   Action
Movie 4   Sci-Fi, Romance, Comedy
Movie 5   NA
etc.

My desired output:

 Title     Action  Fantasy  Drama  Sci-Fi  Romance  Comedy
Movie 1     1       1        0      0        0       0
Movie 2     0       1        1      0        0       0
Movie 3     1       0        0      0        0       0
Movie 4     0       0        0      1        1       1
Movie 5     0       0        0      0        0       0  
etc.

Thanks!

razortight
  • 43
  • 4
  • do you know all possible values in the column (i.e. A, B, ...F) in advance? – Alena Volkova Jun 16 '20 at 13:06
  • Welcome to Stackoverflow! Please update your question with the desired behavior, specific problems, and code to reproduce it. See: How to create [a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/minimal-reproducible-example). – isAif Jun 16 '20 at 13:07
  • @AlenaVolkova Yes, I know the possible values in the column. – razortight Jun 16 '20 at 13:32
  • @HeisAif I'm not really sure what to put since the only thing I have so far is a data frame from a .csv file. My issue right now is how to use one hot encoding on my data frame. – razortight Jun 16 '20 at 13:48

2 Answers2

5

Considering the input data as:

import pandas as pd
data = {'Title': ['Movie 1', 'Movie 2', 'Movie 3', 'Movie 4', 'Movie 5'], 
        'column': ['Action, Fantasy', 'Fantasy, Drama', 'Action', 'Sci-Fi, Romance, Comedy', np.nan]}
df = pd.DataFrame(data)
df
    Title   column
0   Movie 1 Action, Fantasy
1   Movie 2 Fantasy, Drama
2   Movie 3 Action
3   Movie 4 Sci-Fi, Romance, Comedy
4   Movie 5 NaN

This code produces the desired output:

# treat null values
df['column'].fillna('NA', inplace = True)

# separate all genres into one list, considering comma + space as separators
genre = df['column'].str.split(', ').tolist()

# flatten the list
flat_genre = [item for sublist in genre for item in sublist]

# convert to a set to make unique
set_genre = set(flat_genre)

# back to list
unique_genre = list(set_genre)

# remove NA
unique_genre.remove('NA')

# create columns by each unique genre
df = df.reindex(df.columns.tolist() + unique_genre, axis=1, fill_value=0)

# for each value inside column, update the dummy
for index, row in df.iterrows():
    for val in row.column.split(', '):
        if val != 'NA':
            df.loc[index, val] = 1

df.drop('column', axis = 1, inplace = True)    
df
    Title   Action  Fantasy Comedy  Sci-Fi  Drama   Romance
0   Movie 1 1       1       0       0       0       0
1   Movie 2 0       1       0       0       1       0
2   Movie 3 1       0       0       0       0       0
3   Movie 4 0       0       1       1       0       1
4   Movie 5 0       0       0       0       0       0

UPDATE: I've added a null value into the test data, and treat it appropriately in the first line of the solution.

Daniel Labbe
  • 1,979
  • 3
  • 15
  • 20
  • This code works when I declare the data in the code but not for when I get my data from a csv file using this code: `df = pd.DataFrame(pd.read_csv('Sample.csv', names = ['Title', 'column']))` There's an error at the flat_genre line, saying that 'float' object is not iterable. – razortight Jun 17 '20 at 09:29
  • I see, your data contains a null value, not the string NA. The null value is represente as a numerical constant of type float in numpy. I've changed the test data to consider this situation and treat this in the solution. Please, check if it works now. – Daniel Labbe Jun 17 '20 at 10:51
1
### Import libraries and load sample data

import numpy as np
import pandas as pd

data = {
    'Movie 1': ['Action, Fantasy'],
    'Movie 2': ['Fantasy, Drama'],
    'Movie 3': ['Action'],
    'Movie 4': ['Sci-Fi, Romance, Comedy'],
    'Movie 5': ['NA'],
}

df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)

At this stage our DataFrame looks like this:

           column
Movie 1    Action, Fantasy
Movie 2    Fantasy, Drama
Movie 3    Action
Movie 4    Sci-Fi, Romance, Comedy
Movie 5    NA

Now, the question we're asking is - does a given genre word ("sub-string") occur in 'column' for a given movie?

To do this we'll first need a list of genre words:

### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))

Depending on how large your dataset is, this could be computationally costly. You mentioned that you know the unique values already. So you could just define the iterable 'genres' manually.

Getting the OneHotVectors:

for genre in genres:
    df[genre] = df['column'].str.contains(genre).astype('int')

df.drop('column', axis=1, inplace=True)

We loop through each genre, we ask whether the genre exists in 'column', this returns a True or False, which is converted to 1 or 0 respectively - when we cast to type('int').

We end up with:

          Action    Comedy  Drama   Fantasy Romance Sci-Fi
Movie 1        1         0      0         1       0      0
Movie 2        0         0      1         1       0      0
Movie 3        1         0      0         0       0      0
Movie 4        0         1      0         0       1      1
Movie 5        0         0      0         0       0      0

Maaz Basar
  • 13
  • 3
  • This also works, but I'm also encountering an error here when I get my data from a csv file using `df = pd.DataFrame(pd.read_csv('Sample.csv', names = ['Title', 'column']))`. The error is at the line `genres = np.unique(', '.join(df['column']).split(', '))` and says "sequence item 4: expected str instance, float found". – razortight Jun 17 '20 at 09:37
  • @razortight That's probably because you have NaNs in your column. NaNs are of type float. You could do a ```df.fillna('NA', inplace=True)```. Also, you don't need to do a ```pd.DataFrame()``` on top of a ```pd.read_csv()``` - there's nothing technically "wrong" with it, it's just redundant and doesn't make any difference. – Maaz Basar Jun 17 '20 at 10:03