Machine Learning feature-engine MeanEncoder gives error with Cancer dataset

Question

I'm working with the wisconsin breast cancer dataset found here. Feature engineering is important in machine learning so a teacher of mine recommended the MeanEncoder part of a library found here. The dataframe looks like the following:

I did specifically change the diagnosis feature/column to category because one of the errors said that might of been the issue but apparently not as it's not solved.

I want to mean encode the target feature/column using MeanEncode found in the library linked above. Here's my function to attempt to do so:

def MeanEncoding(self):
   # Get the columns besides the target variable at the front, which is diagnosis, as recommended by teacher.
   cols = self.m_df.iloc[:, 1:].columns.to_list()

   # Save specifically the target variable too.
   target = self.m_df.iloc[:, 0]

   # Now get the object ready.
   encoder = MeanEncoder(variables=cols)

   print('---Fitting---')

   encoder.fit(self.m_df.drop('diagnosis', axis=1), target)

In this code:

m_df - just the dataframe hence the "df"
I drop the diagnosis column/feature in the first argument of encoder.fit, since it's provided in the 2nd argument of the same function. But it means nothing. Because I still get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"

Now with #2, I'm thinking, "No way, I have to transform the numeric features which are 'radius_mean', 'texture_mean', etc into category or object? That makes 0 sense". But I google this error of course and it brings me to this SO thread. This individual is having similar concerns like me except with a different function. The suggestion for him was "Just change the dtype of grade column to object before using imputer", so I change the types as well to object with the following code:

for i in range(1, len(self.m_df.columns)):
   columnName = self.m_df.columns[i]
   self.m_df[columnName] = self.m_df[columnName].astype('object')

Doesn't make sense to me because it's converting the types of genuine numeric columns/features. I get this error which is KIND of expected:

pandas.core.base.DataError: No numeric types to aggregate

Now I'm thinking it just wants a few numeric types, so I slightly alter the code:

  for i in range(1, len(self.m_df.columns) - 2):
      columnName = self.m_df.columns[i]
      self.m_df[columnName] = self.m_df[columnName].astype('object')

Which literally just leaves the last 2 columns as float64 types and therefore all others are type object (besides the diagnosis column which is category but I doubt that matters). Now some numeric types ARE present. Yet I still get the error again

TypeError: Some of the variables are not categorical. Please cast 
them as object or category before calling this transformer

I am clearly missing something but not sure what. No matter how I alter the types to satisfy the function, it's wrong.

score 1 · Answer 1 · answered Jan 31 '22 at 14:16

The MeanEncoder from Feature-engine, as well as all other Feature-engine encoders, work only on variables cast as object or category by default.

So the variables captured in the list cols in this line of code: cols = self.m_df.iloc[:, 1:].columns.to_list() should only contain categorical variables (object or category).

When you set up the encoder here: encoder = MeanEncoder(variables=cols), in variables, you indicate the variables to encode. If you pass cols, it means you want to encode all the variables within the cols list. So you need to ensure that all of them are of type category or object.

If you get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"it means that some of the variables in cols are not of type object or category.

If you want to encode numerical variables, there are 2 options: 1) recast the variables you want to encode as object. 2) set the parameter ignore_format=True as per the transformer's documentation. That should solve your problem.

Machine Learning feature-engine MeanEncoder gives error with Cancer dataset

1 Answers1