Use this tag for questions related to feature extraction from raw data (including text and images) through the class DictVectorizer from Python's scikit-learn library.
Questions tagged [dictvectorizer]
23 questions
4
votes
3 answers
AttributeError: 'Pipeline' object has no attribute 'partial_fit'
I am trying to train my binary classifier over a huge data. Previously, I could accomplish training via using fit method of sklearn. But now, I have more data and I cannot cope with them. I am trying to fitting them partially but couldn't get rid of…

kntgu
- 184
- 1
- 2
- 14
4
votes
3 answers
How to encode categorical features in sklearn?
I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:
A subset of string type(the column-features 1, 2, 3)
A subset of int type, in binary form 0 or 1 (the…

Gil
- 111
- 1
- 7
4
votes
2 answers
How can I encode features with more than one value per column? MultiDictVectorizer needed?
I am vectorizing some features in sklearn, and I have run into a problem. DictVectorizer works well if your data can be encoded into one dict key per item. What if your items can have two or more values of the same column? For instance,…

rjurney
- 4,824
- 5
- 41
- 62
3
votes
1 answer
How to use Scikit Learn dictvectorizer to get encoded dataframe from dense dataframe in Python?
I have a dataframe as follows:
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
From this I want to create an encoded dataset (for fastFM) as follows:
user1 user2 user4 user4…

exAres
- 4,806
- 16
- 53
- 95
1
vote
0 answers
how to solve model fitting shape error dictVectorization?
I'm working on a pos tagging problem and using LogisticRegressionCV model to solve it. I extracted features of words and vectorized them with DictVectorizer(). However, I'm getting an error while model is fitting. After model.fit part, the console…

Nilay Yilmaz
- 11
- 2
1
vote
2 answers
Is it possible to create an equivalent "restrict" method for CountVectorizer as is available for DictVectorizer in Scikit-learn?
For DictVectorizer it is possible to subset the object by using the restrict() method. Here is an example where I have explicitly listed the features to retain by using a boolean array.
import numpy as np
v = DictVectorizer()
D = [{'foo': 1,…

Billy Franks
- 23
- 6
1
vote
1 answer
Python sklearn MultinomialNB: Dimension mismatch using DictVectorizer
I'm trying to do MultinomialNB. I got Value Error: dimension mismatch.
I'm using DictVectorizer for the training data and LabelEncoder for the class.
This is my code:
def create_token(inpt):
return inpt.split(' ')
def tok_freq(inpt):
tok =…

jted95
- 1,084
- 1
- 9
- 23
1
vote
0 answers
Method of vectors in various vector length to fixed length (NLP)
Recently I have been looking around about Natural Language Processing and its vectorization method and advantages of each vectorizer.
I am into character to vectorize, but it seems like the most concerns about the character vectorizer for each word…

Isaac Sim
- 539
- 1
- 7
- 23
1
vote
5 answers
using DictVectorizer to convert strings
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years dept salary
0.38 0.53 2 157 3 0 1 0 TECHNICAL low
0.8 0.86 5 262 6 0 1 0 …

Vineeth
- 11
- 1
- 3
1
vote
1 answer
Converting string data to float before passing to SVM classifier
I have a dataset as follows:
X_data =
BankNum | ID |
00987772 | AB123 |
00987772 | AB123 |
00987772 | AB123 |
00987772 | ED245 |
00982123 | GH564 |
And another one as:
y_data =
ID | Labels
AB123 | High
ED245 | Low
GH564 | Low
I'm…

Xavier
- 227
- 1
- 3
- 11
1
vote
1 answer
Why would DictVectorizer change the number of features?
I have a dataset of 324 rows and 35 columns. I split it into training and testing data:
X_train, X_test, y_train, y_test = train_test_split(tempCSV[feaure_names[0:34]], tempCSV[feaure_names[34]], test_size=0.2, random_state=32)
This seems to…

Nicholas Hassan
- 949
- 2
- 10
- 27
1
vote
0 answers
Different results when using pd.get_dummies() and DictVectorizer() with categorical variables
I have a problem when i try to use categorical variables in pipeline.
pd.get_dummies() is a terrific tool but we can not use it right in pipeline. So I had to use DictVectorizer(). I do it as below (toy example)
import numpy as np
import pandas as…

Edward
- 4,443
- 16
- 46
- 81
1
vote
1 answer
Categorical variables in pipeline: dimension mismatch
I try to build a pipeline with categorical variables
import numpy as np
import pandas as pd
import sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import linear_model
from sklearn.pipeline import Pipeline
df =…

Edward
- 4,443
- 16
- 46
- 81
1
vote
1 answer
ngram vectorization - if new token found which not exists in corpus, what should I do with it
I'm building custom ngram vectorizer for bag of word model. I'm qurious - what should I do if during vectorizing of a short text I found new token, which not exists in corpus vocabulary. Should it be just skipped or what?

Ph0en1x
- 9,943
- 8
- 48
- 97
1
vote
1 answer
Categorical variables in sklearn pipeline with DictVectorizer
I want to apply a pipeline with numeric & categorical variables as below
import numpy as np
import pandas as pd
from sklearn import linear_model, pipeline, preprocessing
from sklearn.feature_extraction import DictVectorizer
df =…

Edward
- 4,443
- 16
- 46
- 81