Questions tagged [countvectorizer]

This tag is for questions on the process of turning a collection of text documents into numerical feature vectors using the class CountVectorizer from Python's scikit-learn library.

347 questions
18
votes
2 answers

Sklearn: adding lemmatizer to CountVectorizer

I added lemmatization to my countvectorizer, as explained on this Sklearn page. from nltk import word_tokenize from nltk.stem import WordNetLemmatizer class LemmaTokenizer(object): def __init__(self): self.wnl =…
Rens
  • 492
  • 1
  • 5
  • 14
17
votes
3 answers

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing. My…
14
votes
2 answers

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example 'and' 123 times, 'to' 100 times, 'for' 90 times,…
11
votes
2 answers

CountVectorizer does not print vocabulary

I have installed python 2.7, numpy 1.9.0, scipy 0.15.1 and scikit-learn 0.15.2. Now when I do the following in python: train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun,…
Archana
  • 193
  • 1
  • 2
  • 10
10
votes
1 answer

Empty vocabulary for single letter by CountVectorizer

Trying to convert string into numeric vector, ### Clean the string def names_to_words(names): print('a') words = re.sub("[^a-zA-Z]"," ",names).lower().split() print('b') return words ### Vectorization def Vectorizer(): …
LookIntoEast
  • 8,048
  • 18
  • 64
  • 92
10
votes
2 answers

sklearn partial fit of CountVectorizer

Does CountVectorizer support partial fit? I would like to train the CountVectorizer using different batches of data.
Donbeo
  • 17,067
  • 37
  • 114
  • 188
9
votes
1 answer

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the…
Logan Yang
  • 2,364
  • 6
  • 27
  • 43
9
votes
4 answers

Apply CountVectorizer to column with list of words in rows in Python

I made a preprocessing part for text analysis and after removing stopwords and stemming like this: test[col] = test[col].apply( lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words]) train[col] =…
Yury Wallet
  • 1,474
  • 1
  • 13
  • 24
7
votes
1 answer

Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)

I have a dataset with ~30k unique documents that were flagged because they have a certain keyword in them. Some of the key fields in the dataset are document title, filesize, keyword, and excerpt (50 words around keyword). Each of these ~30k unique…
6
votes
1 answer

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF…
6
votes
1 answer

How to get CountVectorizer feature_names in order that they are set, not alphabetical?

I am trying to vectorize some data using sklearn.feature_extraction.text.CountVectorizer. This is the data that I am trying to vectorize: corpus = [ 'We are looking for Java developer', 'Frontend developer with knowledge in SQL and Jscript', …
6
votes
1 answer

dimension mismatch error in CountVectorizer MultinomialNB

Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data…
Chris T.
  • 1,699
  • 7
  • 23
  • 45
6
votes
1 answer

How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?

Is there any way for me to preserve punctuation marks of !, ?, " and ' from my text documents using text CountVectorizer or TfidfVectorizer parameters in scikit-learn?
5
votes
1 answer

Encoding text in ML classifier

I am trying to build a ML model. However I am having difficulties in understanding where to apply the encoding. Please see below the steps and functions to replicate the process I have been following. First I split the dataset into train and test: #…
LdM
  • 674
  • 7
  • 23
5
votes
1 answer

Lemmatization on CountVectorizer doesn't remove Stopwords

I'm trying to add Lematization to CountVectorizer from Skit-learn,as follows import nltk from pattern.es import lemma from nltk import word_tokenize from nltk.corpus import stopwords from sklearn.feature_extraction.text import CountVectorizer from…
ambigus9
  • 1,417
  • 3
  • 19
  • 37
1
2 3
23 24