0

Creating a Decision Tree and the dataset has 21 columns, a mix of numeric and categorical variables. Using sklearn, I understand it does not support categorical variables. I converted categorical to numeric using Label Encoding while also separating the numeric variables. I would then think I'd have to add both groups together so I can split into testing and training data. However when I tried to add the two together (originally numeric variables with the categorical variables converted to numeric) I received a ValueError.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

credit = pd.read_csv('german_credit_risk.csv')
credit.head(10)

image of output

credit.info()

image of output

credit.describe(include='all')

image ouf output

col_names = ['Duration', 'Credit.Amount', 'Disposable.Income', 'Present.Residence', 'Age', 'Existing.Credits', 'Number.Liable', 'Cost.Matrix']
obj_cols = list(credit.select_dtypes(include='O').columns)
obj_cols

image of output

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

encoded_obj_df = pd.DataFrame(columns=obj_cols)

for col in obj_cols:
    encoded_obj_df[col] = le.fit_transform(credit[col])

encoded_obj_df.head(10)

image of output

credit.columns = col_names + encoded_obj_df

ValueError

Do I have the right idea and I'm just not adding the two together properly?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Please do **not** post screenshots of output and error messages; paste them here as *text* - see how to create a [mre]. – desertnaut Feb 06 '22 at 11:42

1 Answers1

0

The error occurred because you are adding a list of strings to a DataFrame and try to assign the result of this operation to column names of other DataFrame. You would need to concatenate data frames (with only numerical and label encoded values) on axis 1 with pd.concat function.

However, as you are using Scikit Learn then I would advise you to use it to the full extend. There is Pipeline and ColumnTransformer classes that can help you with the task of preprocessing and classification.

The Pipeline combines the sequence of SK Learn transformers so you don't need to pass the data to each component by yourself.

The ColumnTransformer is used to select the data and apply given transformers to the given data slices. Then it automatically combines the processed (and remained) data into single np.array.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer

clf = make_pipeline(
    ColumnTransformer(
       [('categorical', LabelEncoder(), credit.select_dtypes(include='O').columns)],
        remainder='passthrough'
    ),
    DecisionTreeClassifier()
)

You can then use the standard clf.fit and clf.predict on the resulting pipeline and all of the data processing and prediction will happen at once.