### Import libraries and load sample data
import numpy as np
import pandas as pd
data = {
'Movie 1': ['Action, Fantasy'],
'Movie 2': ['Fantasy, Drama'],
'Movie 3': ['Action'],
'Movie 4': ['Sci-Fi, Romance, Comedy'],
'Movie 5': ['NA'],
}
df = pd.DataFrame.from_dict(data, orient='index')
df.rename(columns={0:'column'}, inplace=True)
At this stage our DataFrame looks like this:
column
Movie 1 Action, Fantasy
Movie 2 Fantasy, Drama
Movie 3 Action
Movie 4 Sci-Fi, Romance, Comedy
Movie 5 NA
Now, the question we're asking is - does a given genre word ("sub-string") occur in 'column' for a given movie?
To do this we'll first need a list of genre words:
### Join every string in every row, split the result, pull out the unique values.
genres = np.unique(', '.join(df['column']).split(', '))
### Drop 'NA'
genres = np.delete(genres, np.where(genres == 'NA'))
Depending on how large your dataset is, this could be computationally costly. You mentioned that you know the unique values already. So you could just define the iterable 'genres' manually.
Getting the OneHotVectors:
for genre in genres:
df[genre] = df['column'].str.contains(genre).astype('int')
df.drop('column', axis=1, inplace=True)
We loop through each genre, we ask whether the genre exists in 'column', this returns a True or False, which is converted to 1 or 0 respectively - when we cast to type('int').
We end up with:
Action Comedy Drama Fantasy Romance Sci-Fi
Movie 1 1 0 0 1 0 0
Movie 2 0 0 1 1 0 0
Movie 3 1 0 0 0 0 0
Movie 4 0 1 0 0 1 1
Movie 5 0 0 0 0 0 0