I would like to prepare my dataset to be used by machine learning algorithms. I have a feature composed by a list of the tags associated to every TV series (my records). It is possible to apply the one-hot encoding directly or it would be preferable to first extract all the possible elements of the aforementioned lists? My idea is to use this tags for the next analysis.
Here is an example of my dataset and the code applied to it.
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
indexer = StringIndexer(inputCol="tags", outputCol="tagsIndex")
df = indexer.fit(df).transform(df)
ohe = OneHotEncoder(inputCol="tagsIndex", outputCol="tagsOHEVector")
df = ohe.fit(df).transform(df)