Trouble with OneHotEncoding, Column Transformer and Linear Regression in Python

Question

I'm using a ColumnTransformer in my Python script to transform categorical variables in a dataset for use in a linear regression model. I've used the OneHotEncoder to transform the categorical variable in question, and the transformer appears to be working correctly based on the output. However, when I try to fit the transformed data to a LinearRegression model, I receive the error ValueError: could not convert string to float: 'New Hampshire'. I suspect the issue may be related to the ColumnTransformer not properly converting the categorical variable to a numerical format, but I'm not sure how to address this issue. Any suggestions on how to resolve this error would be greatly appreciated

the link to dataset https://www.kaggle.com/datasets/justin2028/unemployment-in-america-per-us-state

import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression   
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
  
data_set = pd.read_csv('Unemployment in America Per US State.csv')  
  
X = data_set.iloc[:, :-1]
y = data_set.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(sparse=False), [1])], remainder='passthrough')
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)
 
y_pred = regressor.predict(X_test)`

I am using a column transformer to transform my data. And I have applied one-hot encoding on categorical variables. I was expecting the categorial data to be transformed but it give me this error ValueError: could not convert string to float.

score 0 · Answer 1 · answered Aug 07 '23 at 16:58

I think the error is because the table uses commas for thousands, meaning some numerical columns are stored as strings rather than as a numerical type. You can handle this by adding thousands=',' in the pd.read_csv(...) line:

data_set = pd.read_csv('Unemployment in America Per US State.csv', thousands=',')

After doing this, you'll find that all columns are of numerical type except one - see print(data_set.dtypes). You handle it correctly using ColumnTransformer, so the error should no longer occur.

Trouble with OneHotEncoding, Column Transformer and Linear Regression in Python

1 Answers1