Questions tagged [one-hot-encoding]

One-Hot Encoding is a method to encode categorical variables to numerical data that Machine Learning algorithms can deal with. One-Hot encoding is most used during feature engineering for a ML Model. It converts categorical values into a new categorical column and assign a binary value of 1 or 0 to those columns.

Also known as Dummy Encoding, One-Hot Encoding is a method to encode categorical variables, where no such ordinal relationship exists, to numerical data that Machine Learning algorithms can deal with. One hot encoding is the most widespread approach, and it works very well unless your categorical variable takes on a large number of unique values. One hot encoding creates new, binary columns, indicating the presence of each possible value from the original data. These columns store ones and zeros for each row, indicating the categorical value of that row.

1224 questions
2
votes
3 answers

How to One Hot Encode a Dataframe Column in Python?

I'm trying to convert a column Dataframe with One Hot Encoder with this code. from sklearn.preprocessing import OneHotEncoder df['label'] = OneHotEncoder().fit(df['label']).toarray() This is the traceback ValueError: Expected 2D array, got 1D array…
2
votes
1 answer

How can I recode 53k unique addresses (saved as objects) w/o One-Hot-Encoding in Pandas?

My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't…
Ryan
  • 1,312
  • 3
  • 20
  • 40
2
votes
1 answer

One Hot Encoder- Classification by categories

For model training, I have a vector with repeating values (numbers) I want to divide this vector into 10 different categories by number proximity (a kind of clustring) so that my output will be N * 10 (N is the number of values in the vector) of…
HELO
  • 67
  • 8
2
votes
2 answers

LabelEncoder().fit_transform gives me negative values?

Hei, I have different city names in the column "City" in my dataset. I would love to encode it using LabelEncoder(). However, I got quite frustrating results with negative values df['city_enc'] =…
2
votes
1 answer

"ValueError: A given column is not a column of the dataframe" when trying to convert categorical feature into numerical

I am using a csv file from a Udemy course for the sake of training. I only want to use age and country columns to keep things simple. Here is the code: import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.compose…
2
votes
0 answers

OneHotEncoder with pipeline and cross_val_score

I am doing the following: def make_trans(verbose=False): ct = ColumnTransformer( [ ('num', StandardScaler(), num_cols), ('cat', TestEncoder(), cat_cols) ], verbose=verbose ) return ct def…
ilya
  • 119
  • 1
  • 13
2
votes
0 answers

How to deal with one hot encoding transformations for columns while predicting for new data?

I have made a model for predicting the yield of a crop on the basis of several features. It initially had 8 columns and after one hot encoding 4 of its columns it has a total of 813 columns. I saved the model and the encoder and I use the following…
Ravish Jha
  • 481
  • 3
  • 25
2
votes
1 answer

Is this data representation here exact for One-Hot Encoding?

I am trying to encode the mushroom dataset here (https://www.kaggle.com/uciml/mushroom-classification) using One-Hot Encoding. Here is the code that I used (in Python) for the encoding: from sklearn.preprocessing import OneHotEncoder second_df =…
2
votes
1 answer

How to use get_dummies or one hot encoding to encode a categorical feature with multiple elements?

I'm working on a dataset which has a feature called categories. The data for each observation in that feature consists of semi-colon delimited list eg. Rows categories Row 1 "categorya;categoryb;categoryc" Row 2 "categorya;categoryb" Row…
Jim Jones
  • 47
  • 3
2
votes
1 answer

One Hot Encoding a 2 categorical variable

For variables with two categories, do they need to be One Hot Encoded? In my dataset I have a binary variable as either 1 or 0. Do I need to transform that variable in a pipeline for my model or do I leave it as is? variable =…
Jack Armstrong
  • 1,182
  • 4
  • 26
  • 59
2
votes
1 answer

Can we use numpy array as input to perform Tfidfvectorizer() on text data, inside of make_column_transform()?

I am trying to perform multiple column transformations using OneHotEncoder() and TfidfVectorizer() on my training data which is a numpy array. I am trying to use make_column_transformer() to perform all transformations at once. X_train is my input…
2
votes
1 answer

Multi-Feature One-Hot-Encoder with varying amount of feature instances

Let's assume we have data instances like this: [ [15, 20, ("banana","apple","cucumber"), ...], [91, 12, ("orange","banana"), ...], ... ] I am wondering how I can encode the third element of these datapoints. For multiple features values…
2
votes
1 answer

What is difference between One Hot Encoding and pandas.categorical.code

I am working on some problem and have a doubt as below: In the data set there is a text column with following unique values: array(['1 bath', 'na', '1 shared bath', '1.5 baths', '1 private bath', '2 baths', '1.5 shared baths', '3 baths',…
2
votes
1 answer

Differencies between OneHotEncoding (sklearn) and get_dummies (pandas)

I am wondering what is the difference between pandas' get_dummies() encoding of categorical features as compared to the sklearn's OneHotEncoder(). I've seen answers that mention that get_dummies() cannot produce encoding for categories not seen in…
2
votes
1 answer

How to deal with a target variable containing nominal data?

Im working on an NLP project whose target variable contains seven unique sentences which are "inspirational and thought-provoking ", "informative", "acknowledgment and appreciations" and 4 others. As for my understanding, the target variable as we…