Questions tagged [dictvectorizer]

Use this tag for questions related to feature extraction from raw data (including text and images) through the class DictVectorizer from Python's scikit-learn library.

23 questions
4
votes
3 answers

AttributeError: 'Pipeline' object has no attribute 'partial_fit'

I am trying to train my binary classifier over a huge data. Previously, I could accomplish training via using fit method of sklearn. But now, I have more data and I cannot cope with them. I am trying to fitting them partially but couldn't get rid of…
kntgu
  • 184
  • 1
  • 2
  • 14
4
votes
3 answers

How to encode categorical features in sklearn?

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset: A subset of string type(the column-features 1, 2, 3) A subset of int type, in binary form 0 or 1 (the…
4
votes
2 answers

How can I encode features with more than one value per column? MultiDictVectorizer needed?

I am vectorizing some features in sklearn, and I have run into a problem. DictVectorizer works well if your data can be encoded into one dict key per item. What if your items can have two or more values of the same column? For instance,…
3
votes
1 answer

How to use Scikit Learn dictvectorizer to get encoded dataframe from dense dataframe in Python?

I have a dataframe as follows: user item affinity 0 1 13 0.1 1 2 11 0.4 2 3 14 0.9 3 4 12 1.0 From this I want to create an encoded dataset (for fastFM) as follows: user1 user2 user4 user4…
exAres
  • 4,806
  • 16
  • 53
  • 95
1
vote
0 answers

how to solve model fitting shape error dictVectorization?

I'm working on a pos tagging problem and using LogisticRegressionCV model to solve it. I extracted features of words and vectorized them with DictVectorizer(). However, I'm getting an error while model is fitting. After model.fit part, the console…
1
vote
2 answers

Is it possible to create an equivalent "restrict" method for CountVectorizer as is available for DictVectorizer in Scikit-learn?

For DictVectorizer it is possible to subset the object by using the restrict() method. Here is an example where I have explicitly listed the features to retain by using a boolean array. import numpy as np v = DictVectorizer() D = [{'foo': 1,…
1
vote
1 answer

Python sklearn MultinomialNB: Dimension mismatch using DictVectorizer

I'm trying to do MultinomialNB. I got Value Error: dimension mismatch. I'm using DictVectorizer for the training data and LabelEncoder for the class. This is my code: def create_token(inpt): return inpt.split(' ') def tok_freq(inpt): tok =…
jted95
  • 1,084
  • 1
  • 9
  • 23
1
vote
0 answers

Method of vectors in various vector length to fixed length (NLP)

Recently I have been looking around about Natural Language Processing and its vectorization method and advantages of each vectorizer. I am into character to vectorize, but it seems like the most concerns about the character vectorizer for each word…
Isaac Sim
  • 539
  • 1
  • 7
  • 23
1
vote
5 answers

using DictVectorizer to convert strings

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years dept salary 0.38 0.53 2 157 3 0 1 0 TECHNICAL low 0.8 0.86 5 262 6 0 1 0 …
1
vote
1 answer

Converting string data to float before passing to SVM classifier

I have a dataset as follows: X_data = BankNum | ID | 00987772 | AB123 | 00987772 | AB123 | 00987772 | AB123 | 00987772 | ED245 | 00982123 | GH564 | And another one as: y_data = ID | Labels AB123 | High ED245 | Low GH564 | Low I'm…
Xavier
  • 227
  • 1
  • 3
  • 11
1
vote
1 answer

Why would DictVectorizer change the number of features?

I have a dataset of 324 rows and 35 columns. I split it into training and testing data: X_train, X_test, y_train, y_test = train_test_split(tempCSV[feaure_names[0:34]], tempCSV[feaure_names[34]], test_size=0.2, random_state=32) This seems to…
Nicholas Hassan
  • 949
  • 2
  • 10
  • 27
1
vote
0 answers

Different results when using pd.get_dummies() and DictVectorizer() with categorical variables

I have a problem when i try to use categorical variables in pipeline. pd.get_dummies() is a terrific tool but we can not use it right in pipeline. So I had to use DictVectorizer(). I do it as below (toy example) import numpy as np import pandas as…
Edward
  • 4,443
  • 16
  • 46
  • 81
1
vote
1 answer

Categorical variables in pipeline: dimension mismatch

I try to build a pipeline with categorical variables import numpy as np import pandas as pd import sklearn from sklearn.base import BaseEstimator, TransformerMixin from sklearn import linear_model from sklearn.pipeline import Pipeline df =…
Edward
  • 4,443
  • 16
  • 46
  • 81
1
vote
1 answer

ngram vectorization - if new token found which not exists in corpus, what should I do with it

I'm building custom ngram vectorizer for bag of word model. I'm qurious - what should I do if during vectorizing of a short text I found new token, which not exists in corpus vocabulary. Should it be just skipped or what?
Ph0en1x
  • 9,943
  • 8
  • 48
  • 97
1
vote
1 answer

Categorical variables in sklearn pipeline with DictVectorizer

I want to apply a pipeline with numeric & categorical variables as below import numpy as np import pandas as pd from sklearn import linear_model, pipeline, preprocessing from sklearn.feature_extraction import DictVectorizer df =…
Edward
  • 4,443
  • 16
  • 46
  • 81
1
2