0

I have a list of variables with values encoded in a way which throws Pandas off. For example: I have a column named "Alley" and it has a list of values, one of which is NA, which stands for "No Alley". However, Pandas interprets this as NaN. To come across this problem, I am encoding all NaN values with an arbitrary symbol like XX. These variables don't actuall have null/missing values. These are just variables whose values are being misinterpreted by Pandas. I am gathering them in a list:

na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
           'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

And replacing each NaN reading with XX:

for i in na_data:
    df[i] = df[i].fillna('XX')

This was the old error I was getting:

Traceback (most recent call last):
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 'Alley'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Model.py", line 67, in <module>
    print(feature_encoding(train, categorical_columns))
  File "C:/Users/security/Downloads/AP/Boston-Kaggle/Model.py", line 50, in feature_encoding
    df[i] = df[i].fillna('XX')
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\pandas\core\frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\security\AppData\Roaming\Python\Python37\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 'Alley'

The variable Alley definitely exists in the dataset! I copy/pasta the name from the dataset just for good measure.

This is my entire code (updated):

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

categorical_columns = ['MSSubClass', 'MSZoning', 'LotShape', 'LandContour', 'LotConfig', 'Neighborhood', 'Condition1',
                       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                       'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'PavedDrive', 'Fence',
                       'MiscFeature', 'SaleType', 'SaleCondition', 'Street', 'CentralAir', 'Utilities', 'ExterQual',
                       'LandSlope', 'ExterCond', 'HeatingQC', 'KitchenQual']

ranked_columns = ['Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
                  'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond',
                  'PoolQC', 'OverallQual', 'OverallCond']

numerical_columns = ['LotArea', 'LotFrontage', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
                     'BsmtUnfSF','TotalBsmtSF', '1stFlrSF', '2ndFlrSf', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
                     'BsmtHalfBath', 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen', 'TotRmsAbvGrd', 'Fireplaces',
                     'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
                     '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']

na_data = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
           'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

for i in na_data:
    train[i] = train[i].fillna('XX')

#Replaced the NaN values of LotFrontage and MasVnrArea with the mean of their column
train['LotFrontage'] = train['LotFrontage'].fillna(train['LotFrontage'].mean())
train['MasVnrArea'] = train['MasVnrArea'].fillna(train['MasVnrArea'].mean())

concatenated_list = categorical_columns + na_data

# take one-hot encoding
OHE_sdf = pd.get_dummies(train[concatenated_list])

# drop the old categorical column from original df
train.drop(columns = categorical_columns, axis = 1, inplace = True)

# attach one-hot encoded columns to original data frame
train = pd.concat([train, OHE_sdf], axis = 1, ignore_index = False)

x_train, x_test, y_train, y_test = train_test_split(train, train['SalePrice'], test_size = 0.3, random_state = 42)

sel = SelectFromModel(RandomForestClassifier(n_estimators = 100), threshold = 300 * "mean")
sel.fit(x_train, y_train)
sel.get_support()

selected_feat = x_train.columns[sel.get_support()]

print(selected_feat())

This is the new error:

Traceback (most recent call last):
  File "/home/onur/Documents/Boston-Kaggle/Model.py", line 49, in <module>
    sel.fit(x_train, y_train)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/feature_selection/from_model.py", line 196, in fit
    self.estimator_.fit(X, y, **fit_params)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/ensemble/forest.py", line 249, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/sklearn/utils/validation.py", line 496, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/opt/anaconda/envs/lib/python3.7/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'XX'
Onur-Andros Ozbek
  • 2,998
  • 2
  • 29
  • 78

2 Answers2

1

Your concenating the data on the wrong axis

df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)
# Should be
df = pd.concat([df, OHE_sdf], axis = 0, ignore_index = True)

However this will cause another error to throw in that you one hot encoded some of columns listed in na_columns, for instance Garage_Type has been encoded into multiple columns one for each potential value as such it no longer exists so it can't have its nan values replaced.

Edit:

I've updated several parts of the question code to ensure that it runs in it's entirety.

Firstly we need to import all the libraries we will be using, note the addition of numpy

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
import numpy as np

secondly we need to get the data from the source

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

Now we will remove all the NaN's from the data set

# Create a series of how many NaN's are in each column
nanCounts = train.isna().sum()
# Find the total number of NaN's and print it (used to check that this bits doing somethin)
nanTotal = train.isna().sum().sum()
print('NaN\'s found: ', nanTotal)

# Create a template list
nanCols = []
# Iterate over the series and if the value is more than 0 (i.e there are some NaN's present)
for i in range(0,len(nanCounts)):
    if nanCounts[i] > 0:
        # If it is append the current column to the list of columns that contain NaN's
        nanCols.append(train.columns[i])

# Iterate through all the columns which are known to have NaN's
for i in nanCols:
    if train[nanCols][i].dtypes == 'float64':
        # If the column is of the data type float64 (a floating point number), replace it with the mean of the column
        train[i] = train[i].fillna(train[i].mean())
    elif train[nanCols][i].dtypes == 'object':
        # If it's of the data type object (a text string) replace it with XX
        train[i] = train[i].fillna('XX')

# Reprint the total number of NaN's
nanTotal = train.isna().sum().sum()
print('NaN\'s after removal: ', nanTotal)

Now that there are no NaN's in the dataset it is possible to assemble a list of the categorical data

# Create a template list
categorical = []
# Iterate across all the columns checking if they're of the object datatype and if they are appending them to the categorical list
for i in range(0, len(train.dtypes)):
    if train.dtypes[i] == 'object':
        categorical.append(train.columns[i])
# Print out the list of categorical features
print('Categorical columns are: \n', categorical)

Now the code is very similar to the original with a few minor changes due to variable changes

# take one-hot encoding
OHE_sdf = pd.get_dummies(train[categorical])

# drop the old categorical column from original df
train.drop(columns = categorical, axis = 1, inplace = True)

# attach one-hot encoded columns to original data frame
train = pd.concat([train, OHE_sdf], axis = 1, ignore_index = False)

print('splitting dataset')
x_train, x_test, y_train, y_test = train_test_split(train, train['SalePrice'], test_size = 0.3, random_state = 42)

print('Selecting features')
# Note that here i changed the threshold so that it would actually show some features to use
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100), threshold = '1.25*mean')
sel.fit(x_train, y_train)
# Also just straight up save the boolean array it will be quicker and i prefer the formatting this way
selected = sel.get_support()

# Print the boolean array of selected features
print(selected)
# Print the finally selected features
print(train.columns[selected])

All together it looks like

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
import numpy as np

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

nanCounts = train.isna().sum()
nanTotal = train.isna().sum().sum()
print('NaN\'s found: ', nanTotal)

nanCols = []
for i in range(0,len(nanCounts)):
    if nanCounts[i] > 0:
        nanCols.append(train.columns[i])

for i in nanCols:
    if train[nanCols][i].dtypes == 'float64':
        train[i] = train[i].fillna(train[i].mean())
    elif train[nanCols][i].dtypes == 'object':
        train[i] = train[i].fillna('XX')

nanTotal = train.isna().sum().sum()

print('NaN\'s after removal: ', nanTotal)

categorical = []
for i in range(0, len(train.dtypes)):
    if train.dtypes[i] == 'object':
        categorical.append(train.columns[i])

print('Categorical columns are: \n', categorical)

# take one-hot encoding
OHE_sdf = pd.get_dummies(train[categorical])

# drop the old categorical column from original df
train.drop(columns = categorical, axis = 1, inplace = True)

# attach one-hot encoded columns to original data frame
train = pd.concat([train, OHE_sdf], axis = 1, ignore_index = False)

print('splitting dataset')
x_train, x_test, y_train, y_test = train_test_split(train, train['SalePrice'], test_size = 0.3, random_state = 42)

print('Selecting features')
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100), threshold = '1.25*mean')
sel.fit(x_train, y_train)
selected = sel.get_support()

print(selected)
print(train.columns[selected])
Tasty213
  • 395
  • 2
  • 10
  • So to come across that problem, I can just do this before OHE the categorical columns? – Onur-Andros Ozbek Aug 20 '19 at 22:42
  • Yes that should solve the issue. I normally imputate NaN's as early as possible as they tend to disrupt most processing operations. – Tasty213 Aug 21 '19 at 08:25
  • I've just checked an if you sort the NaNs before the OHE you'll still need to change the axis to 0 but that shouldn't be an issue, however it will refusr to run sel.fit() on the data as XX is a string and not a float – Tasty213 Aug 21 '19 at 09:12
  • Yea but the columns of `na_data` don't have numerical values. For example: When I do `print(train['Alley'].unique().tolist())`, the output is: `[nan, 'Grvl', 'Pave']`. The `nan` being the `NA` value that Pandas is misinterpreting. So these columns do not have float values to begin with. – Onur-Andros Ozbek Aug 21 '19 at 11:50
  • Alley hasn't been encoded because it's not in the list of categorical variables, all categorical variables need to be encoded – Tasty213 Aug 21 '19 at 12:20
  • Take a look at the edit on the post. Right after replacing the `NA` values with `XX`, I am concatinating `na_data` with `categorical_list`. Also, how can I OHE `Alley` if Pandas is misreading the `NA` value? – Onur-Andros Ozbek Aug 21 '19 at 12:40
  • You still need to drop `concenated_list` not just `categorical_columns` – Tasty213 Aug 21 '19 at 13:04
  • You're right. After fixing it, I'm getting this error: `ValueError: could not convert string to float: 'BrkFace'`, which means that my encoding of `NA` values as `XX` went through. – Onur-Andros Ozbek Aug 21 '19 at 13:06
  • Although by the looks of it you've still missed out some NaN and some categorical columns from the lists, what dataset are you using there's far too many columns to list them – Tasty213 Aug 21 '19 at 13:14
  • @OnurOzbek please see my updated question it runs correctly on my environment, if it does on yours please mark the answer – Tasty213 Aug 21 '19 at 14:08
  • I have a question about when you removed `NaN`s from the dataset. Do you mean that you've completely dropped those columns from the dataset altogether? – Onur-Andros Ozbek Aug 21 '19 at 16:57
  • Sorry i was unclear, nans in numerical columns are filled with the mean of the column, nans in text columns are filled with xx – Tasty213 Aug 21 '19 at 17:23
  • Ah okay. Let me implement it. If it all goes smooth, you'll get a checkmark from me :) Btw. If you think that this was a well asked question, could you give me an upvote? – Onur-Andros Ozbek Aug 21 '19 at 22:10
  • Sadly i don't have enough rep to upvote on overflow only on datascience – Tasty213 Aug 22 '19 at 09:56
  • Hey man. I got 2 questions. First, I want to drop the column `Id`, which basically indexes every row from 0. I don't think I need that in my model. I just want to run it with you the fact that it won't affect my model. 2nd, my output for `print(train.columns[selected])` gives me a list of columns, all of which are `dtype='object'`. I don't understand why it hasn't used any of the float columns. – Onur-Andros Ozbek Aug 26 '19 at 10:41
0

Your code works for me.

import pandas as pd
import numpy as np

df = pd.DataFrame({'x1':[np.nan,2,3,4,5],'x2':[6,7,np.nan,9,10], 'x3':range(10,15)})

list = ['x1', 'x2']

for i in list:
    df[i] = df[i].fillna('XX')
Maeaex1
  • 703
  • 7
  • 36