Adding sparse matrix from CountVectorizer into dataframe with complimentary information for classifier - keep it in sparse format

Question

I have the following problem. Right now I am building a classifier system which will use text and some additional complimentary information as an input. I store complimentary information in pandas DataFrame. I transform text using CountVectorizer and get a sparse matrix. Now, in order to train a classifier I need to have both inputs in same dataframe. The problem is that, when I merge dataframe with output of CountVectorizer I get a dense matrix, which I means I run out of memory really fast. Is there any way to avoid it and properly merge together this 2 inputs into single dataframe without getting a dense matrix?

Example code:

import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

#how many most popular words we consider
n_features = 5000

df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)

#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
                                max_features=n_features,
                                stop_words='english')

#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])

df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)

#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])


#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']

#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)


#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)

print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)

As you see, I set up my CountVectorizer to have 5000 words. I have just 50000 lines in my original dataframe but I already get a matrix of 50000x5000 cells, which is 2.5 billion of units. It already requires a lot of memory.

[Consider using `SparseSeries` as DataFrame columns](http://stackoverflow.com/a/41916672/5741205) — MaxU - stand with Ukraine, Apr 23 '17 at 23:29
Hey @MaxU, I tried this and it didn't work. Get a following error: shape mismatch: value array of shape (683,) could not be broadcast to indexing result of shape (1,49999) — Maksim Khaitovich, Apr 24 '17 at 03:05
could you provide a small (5-10 features) reproducible sample data set - this would help us to build a working example — MaxU - stand with Ukraine, Apr 24 '17 at 08:15
@MaxU here is a small data set which you just need to save in the source file which is loaded into df: https://pastebin.com/WyXhPk1g — Maksim Khaitovich, Apr 24 '17 at 16:47

score 7 · Answer 1 · answered Apr 24 '17 at 12:05

you dont need to use a data frame.

convert the numerical features from dataframe to a numpy array:

num_feats = df[[cols]].values

from scipy import sparse

training_data = sparse.hstack((count_vectorizer_features, num_feats))

then you can use a scikit-learn algorithm which supports sparse data.

for GBM, you can use xgboost which supports sparse.

score 3 · Accepted Answer · answered Apr 24 '17 at 20:48

As @AbhishekThakur has already said, you don't have to put your one-hot-encoded data into the DataFrame.

But if you want to do so, you can add Pandas.SparseSeries as a columns:

#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
                                max_features=n_features,
                                stop_words='english')

#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))

# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
    df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)

Result:

In [107]: df.head(3)
Out[107]:
        asin  price      reviewerID  LenReview                  Summary  LenSummary  overall  helpful  reviewSentiment         0  \
0  151972036   8.48  A14NU55NQZXML2        199  really a difficult read          23        3        2          -0.7203  0.002632
1  151972036   8.48  A1CSBLAPMYV8Y0         77                      wha           3        4        0          -0.1260  0.005556
2  151972036   8.48  A1DDECXCGHDYZK        114       wordy and drags on          18        1        4           0.5707  0.004545

   ...    think  thought  trailers  trying  wanted  words  worth  wouldn  writing  young
0  ...        0        0         0       0       1      0      0       0        0      0
1  ...        0        0         0       1       0      0      0       0        0      0
2  ...        0        0         0       0       1      0      1       0        0      0

[3 rows x 78 columns]

Pay attention at memory usage:

In [108]: df.memory_usage()
Out[108]:
Index               80
asin               112
price              112
reviewerID         112
LenReview          112
Summary            112
LenSummary         112
overall            112
helpful            112
reviewSentiment    112
0                  112
1                  112
2                  112
3                  112
4                  112
5                  112
6                  112
7                  112
8                  112
9                  112
10                 112
11                 112
12                 112
13                 112
14                 112
                  ...
parts               16   # memory used: # of ones multiplied by 8 (np.int64)
peter               16
picked              16
point               16
quick               16
rating              16
reader              16
reading             24
really              24
reviews             16
stars               16
start               16
story               32
tedious             16
things              16
think               16
thought             16
trailers            16
trying              16
wanted              24
words               16
worth               16
wouldn              16
writing             24
young               16
dtype: int64

Thank you! Works nicely. And the error with broadcasting was coming from the fact that one of my columns ('price') was same as one of the words in TfIdfVectorizer output — Maksim Khaitovich, Apr 25 '17 at 01:05
For people who land here in the future: new versions of pandas (>1.0.0) do not have `SparseSeries` anymore. Instead, these need to be constructed with something like `pd.arrays.SparseArray()`. — vtasca, Nov 22 '22 at 14:28

score 2 · Answer 3 · answered Mar 30 '20 at 08:47

2

Pandas also supports importing sparse matrices, which it stores using its sparseDtype

import scipy.sparse    
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)

Which you could concatenate to the rest of your dataframe

answered Mar 30 '20 at 08:47

rorance_

349
1
10

Adding sparse matrix from CountVectorizer into dataframe with complimentary information for classifier - keep it in sparse format

3 Answers3