More efficient way to get various token count stats from array and list

Question

I'm classifying spam from a list of email text (stored in csv format), but before I can do this, I want to get some simple count stats from the output. I used CountVectorizer from sklearn as a first step and implemented by the following code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

#import data from csv

spam = pd.read_csv('spam.csv')
spam['Spam'] = np.where(spam['Spam']=='spam',1,0)

#split data

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) 

#convert 'features' to numeric and then to matrix or list
cv = CountVectorizer()
x_traincv = cv.fit_transform(X_train)
a = x_traincv.toarray()
a_list = cv.inverse_transform(a)

The output is stored in a matrix (named 'a') or a list of arrays (named 'a_list') format that looks like this

[array(['do', 'I', 'off', 'text', 'where', 'you'], 
       dtype='<U32'),
 array(['ages', 'will', 'did', 'driving', 'have', 'hello', 'hi', 'hol', 'in', 'its', 'just', 'mate', 'message', 'nice', 'off', 'roads', 'say', 'sent', 'so', 'started', 'stay'], dtype='<U32'),      
       ...
 array(['biz', 'for', 'free', 'is', '1991', 'network', 'operator', 'service', 'the', 'visit'], dtype='<U32')]

but I found it a little difficult to get some simple count stats from these outputs, such as longest/shortest token, the average length of tokens, etc. How can I get these simple count stats from the matrix or list output that I generated?

Is this what you are looking for? https://stackoverflow.com/a/16078639/2491761 — tony_tiger, Aug 21 '17 at 19:38
Nope, countvectorizer().vocabulary_ will automatically compile (maybe I shouldn't use this term) frequency for each term. I want to get the term with longest and shortest length. I currently use this 'max_len = len(max(cv.vocabulary_, key=len))' and '[word for word in cv.vocabulary_ if len(word) == max_len]'. Wonder if anyone have better solution? — Chris T., Aug 21 '17 at 19:43

tony_tiger · Answer 1 · 2017-08-23T16:09:44.273

You can load the tokens, token counts, and token lengths into a new Pandas dataframe, then do your custom queries.

Here is a simple example with a toy data set.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts = ["dog cat fish","dog cat cat","fish bird walrus monkey","bird lizard"]

cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)
# https://stackoverflow.com/a/16078639/2491761
tokens_and_counts = zip(cv.get_feature_names(), np.asarray(cv_fit.sum(axis=0)).ravel())

df = pd.DataFrame(tokens_and_counts, columns=['token', 'count'])

df['length'] = df.token.str.len() # https://stackoverflow.com/a/29869577/2491761

# all the tokens with length equal to min token length:
df.loc[df['length'] == df['length'].min(), 'token']

# all the tokens with length equal to max token length:
df.loc[df['length'] == df['length'].max(), 'token']

# all tokens with length less than mean token length:
df.loc[df['length'] < df['length'].mean(), 'token']

# all tokens with length greater than 1 standard deviation from the mean:
df.loc[df['length'] > df['length'].mean() + df['length'].std(), 'token']

Can easily be extended if you want to do queries based on the counts.

@Chris T. is this still not what you were looking for? Please advise — tony_tiger, Aug 30 '17 at 16:48

More efficient way to get various token count stats from array and list

1 Answers1