2

I am trying to write a sklearn based feature extraction pipeline. My pipeline code idea could be splitted in few parts

  1. A parent class where all data preprocessing (if required) could happen
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureExtractor(BaseEstimator, TransformerMixin):
    """This is the parent class for all feature extractors."""
    def __init__(self, raw_data = {}):
        self.raw_data = raw_data

    def fit(self, X, y=None):
        return self
  1. A decorator which helps one define the execution of feature extraction, for intelligently handling the case where one feature is dependent on another feature.
# A decorator to assign order of feature extraction within fearure extractor classes
def feature_order(order):
   def order_assignment(to_func):
       to_func.order = order
       return to_func
   return order_assignment
  1. Finally one of the child class where all feature extraction is happening:
class ChilldFeatureExtractor1(FeatureExtractor):
   """This is the one of the child feature extractor class."""

   def __init__(self, raw_data = {}):
       super().__init__(raw_data)
       self.raw_data = raw_data

   @feature_order(1)
   def foo_plus_one(self):
       return self.raw_data['foo'] + 1

   # This feature extractor depends on value populated in previous feature extractor
   @feature_order(2)
   def foo_plus_one_plus_one(self):
       return self.raw_data['foo_plus_one'] + 1

   def transform(self):
       functions = sorted(
           #get a list of extractor functions with attribute order
           [
           getattr(self, field) for field in dir(self)
           if hasattr(getattr(self, field), "order")
           ],
           #sort the feature extractor functions by their order
           key = (lambda field: field.order)
           )

       for func in functions:
           feature_name = func.__name__
           feature_value = func()
           self.raw_data[feature_name] = feature_value

       return self.raw_data

Testing this code a small input:

if __name__ == '__main__':
    raw_data = {'foo': 1, 'bar': 2}
    fe = ChilldFeatureExtractor1(raw_data)
    print(fe.transform())

Gives error:

Traceback (most recent call last):
  File "/Users/temporaryadmin/deleteme.py", line 55, in <module>
    print(fe.transform())
  File "/Users/temporaryadmin/deleteme.py", line 37, in transform
    [
  File "/Users/temporaryadmin/deleteme.py", line 39, in <listcomp>
    if hasattr(getattr(self, field), "order")
  File "/Users/temporaryadmin/opt/miniconda3/envs/voutopia/lib/python3.8/site-packages/sklearn/base.py", line 450, in _repr_html_
    raise AttributeError("_repr_html_ is only defined when the "
AttributeError: _repr_html_ is only defined when the 'display' configuration option is set to 'diagram'

However when I don't inherit sklearn classes in base class ie. class FeatureExtractor(): then I get proper output:

{'foo': 1, 'bar': 2, 'foo_plus_one': 2, 'foo_plus_one_plus_one': 3}

Any pointer on this?

abhiieor
  • 3,132
  • 4
  • 30
  • 47

2 Answers2

2

Try this before running your code:

from sklearn import set_config
set_config(display='diagram')

That happens because the BaseEstimator class has _repr_hrml_ property that depends on display to be 'diagram' (source). I assume that the property gets evaluated at some point and throws the error.

Max Skoryk
  • 404
  • 2
  • 10
  • 2
    This is indeed working but this looks bit like symptom fix. In normal pipeline one need not to set any config? For holistic answer it would be interesting to point out which part of the code changes this behavior. – abhiieor Feb 08 '22 at 12:28
2

The error traceback indicates where this goes wrong: self has an attribute _repr_html_ listed in its __dir__, but trying to access it with getattr throws that ValueError, as shown in the source link from @maxskoryk's answer.

One fix is to give a default value in the getattr call:

   def transform(self):
       functions = sorted(
           #get a list of extractor functions with attribute order
           [
               getattr(self, field, None) for field in dir(self)
               if hasattr(getattr(self, field, None), "order")
           ],
           #sort the feature extractor functions by their order
           key = (lambda field: field.order),
       )
       ...

You could also just limit to attributes not starting with an underscore, or any other reasonable way to limit which attributes get checked.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29