This tag is for questions on the process of turning a collection of text documents into numerical feature vectors using the class CountVectorizer from Python's scikit-learn library.
Questions tagged [countvectorizer]
347 questions
18
votes
2 answers
Sklearn: adding lemmatizer to CountVectorizer
I added lemmatization to my countvectorizer, as explained on this Sklearn page.
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __init__(self):
self.wnl =…

Rens
- 492
- 1
- 5
- 14
17
votes
3 answers
How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?
I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.
My…

Kevin Markham
- 5,778
- 1
- 28
- 36
14
votes
2 answers
List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer
I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example
'and' 123 times, 'to' 100 times, 'for' 90 times,…

user1506145
- 5,176
- 11
- 46
- 75
11
votes
2 answers
CountVectorizer does not print vocabulary
I have installed python 2.7, numpy 1.9.0, scipy 0.15.1 and scikit-learn 0.15.2.
Now when I do the following in python:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun,…

Archana
- 193
- 1
- 2
- 10
10
votes
1 answer
Empty vocabulary for single letter by CountVectorizer
Trying to convert string into numeric vector,
### Clean the string
def names_to_words(names):
print('a')
words = re.sub("[^a-zA-Z]"," ",names).lower().split()
print('b')
return words
### Vectorization
def Vectorizer():
…

LookIntoEast
- 8,048
- 18
- 64
- 92
10
votes
2 answers
sklearn partial fit of CountVectorizer
Does CountVectorizer support partial fit?
I would like to train the CountVectorizer using different batches of data.

Donbeo
- 17,067
- 37
- 114
- 188
9
votes
1 answer
Scala Spark - split vector column into separate columns in a Spark DataFrame
I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the…

Logan Yang
- 2,364
- 6
- 27
- 43
9
votes
4 answers
Apply CountVectorizer to column with list of words in rows in Python
I made a preprocessing part for text analysis and after removing stopwords and stemming like this:
test[col] = test[col].apply(
lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])
train[col] =…

Yury Wallet
- 1,474
- 1
- 13
- 24
7
votes
1 answer
Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
I have a dataset with ~30k unique documents that were flagged because they have a certain keyword in them. Some of the key fields in the dataset are document title, filesize, keyword, and excerpt (50 words around keyword). Each of these ~30k unique…

Derek Jedamski
- 195
- 1
- 9
6
votes
1 answer
Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer
I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF…

Highchiller
- 194
- 2
- 11
6
votes
1 answer
How to get CountVectorizer feature_names in order that they are set, not alphabetical?
I am trying to vectorize some data using
sklearn.feature_extraction.text.CountVectorizer.
This is the data that I am trying to vectorize:
corpus = [
'We are looking for Java developer',
'Frontend developer with knowledge in SQL and Jscript',
…

nedzad
- 118
- 9
6
votes
1 answer
dimension mismatch error in CountVectorizer MultinomialNB
Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right.
Ok, so I split my 'spam email' text data…

Chris T.
- 1,699
- 7
- 23
- 45
6
votes
1 answer
How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?
Is there any way for me to preserve punctuation marks of !, ?, " and ' from my text documents using text CountVectorizer or TfidfVectorizer parameters in scikit-learn?

Suhairi Suhaimin
- 143
- 3
- 13
5
votes
1 answer
Encoding text in ML classifier
I am trying to build a ML model. However I am having difficulties in understanding where to apply the encoding.
Please see below the steps and functions to replicate the process I have been following.
First I split the dataset into train and test:
#…

LdM
- 674
- 7
- 23
5
votes
1 answer
Lemmatization on CountVectorizer doesn't remove Stopwords
I'm trying to add Lematization to CountVectorizer from Skit-learn,as follows
import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from…

ambigus9
- 1,417
- 3
- 19
- 37