Let's say I have a dataframe and list of words i.e
toxic = ['bad','horrible','disguisting']
df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})
main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)
samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])
for i,j in enumerate(samp):
for k in j:
main.loc[i,k] = 1
This leads to :
bad disguisting horrible text
0 0 0 1 You look horrible
1 0 0 0 You are good
2 1 1 0 you are bad and disguisting
This is bit faster than get_dummies, but for loops in pandas is not appreciable when there is huge amount of data.
I tried with str.get_dummies
, this will rather one hot encode every word in the series which makes it bit slower.
pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
If I try the same in scipy.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)
This leads to Value Error,y contains new labels
. Is there a way to ignore the error in scipy?
How can I improve the speed of achieving the same, is there any other fast way of doing the same?