I'm classifying spam from a list of email text (stored in csv format), but before I can do this, I want to get some simple count stats from the output. I used CountVectorizer from sklearn as a first step and implemented by the following code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
#import data from csv
spam = pd.read_csv('spam.csv')
spam['Spam'] = np.where(spam['Spam']=='spam',1,0)
#split data
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0)
#convert 'features' to numeric and then to matrix or list
cv = CountVectorizer()
x_traincv = cv.fit_transform(X_train)
a = x_traincv.toarray()
a_list = cv.inverse_transform(a)
The output is stored in a matrix (named 'a') or a list of arrays (named 'a_list') format that looks like this
[array(['do', 'I', 'off', 'text', 'where', 'you'],
dtype='<U32'),
array(['ages', 'will', 'did', 'driving', 'have', 'hello', 'hi', 'hol', 'in', 'its', 'just', 'mate', 'message', 'nice', 'off', 'roads', 'say', 'sent', 'so', 'started', 'stay'], dtype='<U32'),
...
array(['biz', 'for', 'free', 'is', '1991', 'network', 'operator', 'service', 'the', 'visit'], dtype='<U32')]
but I found it a little difficult to get some simple count stats from these outputs, such as longest/shortest token, the average length of tokens, etc. How can I get these simple count stats from the matrix or list output that I generated?