How to get one hot encoding of specific words in a text in Pandas?

Question

Let's say I have a dataframe and list of words i.e

toxic = ['bad','horrible','disguisting']

df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})

main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)

samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])

for i,j in enumerate(samp):
    for k in j:
        main.loc[i,k] = 1

This leads to :

   bad  disguisting  horrible                         text
0    0            0         1            You look horrible
1    0            0         0                 You are good
2    1            1         0  you are bad and disguisting

This is bit faster than get_dummies, but for loops in pandas is not appreciable when there is huge amount of data.

I tried with str.get_dummies, this will rather one hot encode every word in the series which makes it bit slower.

pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)

                          text  bad  horrible  disguisting
0            You look horrible    0         1            0
1                 You are good    0         0            0
2  you are bad and disguisting    1         0            1

If I try the same in scipy.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)

This leads to Value Error,y contains new labels. Is there a way to ignore the error in scipy?

How can I improve the speed of achieving the same, is there any other fast way of doing the same?

MaxU - stand with Ukraine · Accepted Answer · 2018-01-12T13:20:45.610

Use sklearn.feature_extraction.text.CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(vocabulary=toxic)

r = pd.SparseDataFrame(cv.fit_transform(df['text']), 
                       df.index,
                       cv.get_feature_names(), 
                       default_fill_value=0)

Result:

In [127]: r
Out[127]:
   bad  horrible  disguisting
0    0         1            0
1    0         0            0
2    1         0            1

In [128]: type(r)
Out[128]: pandas.core.sparse.frame.SparseDataFrame

In [129]: r.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
bad            3 non-null int64
horrible       3 non-null int64
disguisting    3 non-null int64
dtypes: int64(3)
memory usage: 104.0 bytes

In [130]: r.memory_usage()
Out[130]:
Index          80
bad             8   #  <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)
horrible        8
disguisting     8
dtype: int64

joining SparseDataFrame with the original DataFrame:

In [137]: r2 = df.join(r)

In [138]: r2
Out[138]:
                          text  bad  horrible  disguisting
0            You look horrible    0         1            0
1                 You are good    0         0            0
2  you are bad and disguisting    1         0            1

In [139]: r2.memory_usage()
Out[139]:
Index          80
text           24
bad             8
horrible        8
disguisting     8
dtype: int64

In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame

In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries

In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series

PS in older Pandas versions Sparsed columns loosed their sparsity (got densed) after joining SparsedDataFrame with a regular DataFrame, now we can have a mixture of regular Series (columns) and SparseSeries - really nice feature!

if the toxic length is 10000, it makes it really slow. Anything you suggest for that. — Bharath M Shetty, Jan 12 '18 at 13:27
@Dark, hmm, i think i'd need a bigger sample data set to play with... Is it much slower compared to `str.get_dummies()` approach? — MaxU - stand with Ukraine, Jan 12 '18 at 13:29
perhaps I need to clean toxic and reduce the words. Thank you for this approach I never had a chance to use sparse dataframe, glad I can use it now. — Bharath M Shetty, Jan 12 '18 at 13:39
@Dark, yeah, I love Sparse DataFrames and sparse arrays from scipy - that helps a lot when working on machines with less RAM... — MaxU - stand with Ukraine, Jan 12 '18 at 13:41

score 1 · Answer 2 · answered Jul 11 '20 at 10:48

The accepted answer is deprecated, see release notes:

SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.

Pandas 1.0.5 Solution:

r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']), 
                   df.index,
                   cv.get_feature_names())

How to get one hot encoding of specific words in a text in Pandas?

2 Answers2