Build stand-alone but composeable "atomic" filter functions for a SQL database/Pandas dataframe?

Question

Okay, I asked this: Functional chaining / composing filter functions of DataFrame in Python? and it was erroneously marked duplicate.

So we're trying again:

What I have is a bunch of data that I can load as a SQL table or a Pandas dataframe. What I'd like to do is offer a bunch of simple filter functions that can be composed (but I will not know the order of composition until run time). So I want to offer the use an as-syntatically-simple-as-possible interface to these functions.

Ideally, I'd like to be able to do is offer the user a toolbox of functions (say, is_size, is_red, max_price, for_male, for_female, is_shirt, etc.) and then let them mix and match those how they'd like to get their result:

result = my_clothes.is_red().is_size('large').max_price(100).for_male()

which, of course, would return the same as

result = my_clothes.max_price(100).is_size('large').for_male().is_red(), etc.

Now, as I stated in the previous question, I can do this in Pandas using pipes:

def get_color(df, color):
    return df[df['color'] == color]
def is_shirt(df):
    return df[df['shirt'] == True]

(poll.pipe(is_shirt)
    .pipe(get_color, color=red)
)

That's a little ugly syntactically for the audience this library is intended.

I also figured out a way to build a class around the dataframe, which has a "chain" member, that gets built, and then calls a "done" function that alerts that we're returning the dataframe we've constructed:

class wrapper_for_dataframe():
    my_data = pd.DataDrame()  # the actual data  
    chain = pd.DataFrame()

    def is_shirt_chain(self):
       self.chain = self.chain[self.chain.type == 'shirt']
       return self

    def max_price_chain(self, price):
        self.chain = self.chain[self.chain.price < price]
        return self

    def done(self):
        temp = self.chain.copy()
        self.chain = self.schedule.copy()
        #something to reset/delete self.chain
        return temp

So, with that, I can do things like:

result = wrapper_for_dataframe_instance.is_shirt_chain().max_price_chain(200).done()

(Note: the above notation may not be 100% correct; it's simplified from something else I built, but I can get that to work)

So this is closer, but this suffers from the normal problems of when you build a wrapper around a dataframe; it's sort of bitch to do "normal" pandas stuff with the DF (seemingly you have to build a function for everything, although I think there's probably a way to pass normal Pandas functions to the underlying dataframe).

There are a number of other reasons why this is bad (what happens if you have more than 1 chain at a time? Chaos? Is having an "Extra" copy of the data a good idea? Probably not)

So, is there another way of doing this? I think Django has this facility, but that's a little heavyweight.

Another thought was SQLalchemy; I could shift the whole thing out of Pandas and in to the SQL realm and build functions that make use of the or_ function and SQLalchemy filtering (like this: Using OR in SQLAlchemy). But that means I have to learn SQLalchemy (which I will do, if that's the best solution here).

Anyway, any ideas? Thanks.

It sounds like you are asking about subclassing a DataFrame. While this has its own issues, it's a much more straightforward approach than your wrapper class. http://pandas.pydata.org/pandas-docs/stable/internals.html#subclassing-pandas-data-structures — attitude_stool, Dec 20 '15 at 19:59

score 0 · Answer 1 · edited May 23 '17 at 11:44

So, using hints from this: How to redirect all methods of a contained class in Python? I think I can do it:

def tocontainer(func):
def wrapper(*args, **kwargs):
    result = func(*args, **kwargs)
    return Container(result)
return wrapper

class Container(object):
    def __init__(self, df):
        self.contained = df

    def __str__(self):
        display(self.contained)

    def __getitem__(self, item):
        result = self.contained[item]
        if isinstance(result, type(self.contained)):
           result = Container(result)
        return result

    def __getattr__(self, item):
        result = getattr(self.contained, item)
        if callable(result):
            result = tocontainer(result)
        return result

    def __repr__(self):
        display(self.contained)

    def max_price(self, cost):
        return Container(self.contained[self.contained.price < cost])

    def is_shirt(self):
        return Container(self.contained[self.contained.is_shirt == True])

    def _repr_html_(self):
        return self.contained._repr_html_()

so I can do things like:

my_data = pd.read_csv('my_data.csv')
my_clothes = Container(my_data)
cheap_shirts = my_clothes.is_shirt().max_price(20)

which is exactly what I wanted. Note the necessary calls to wrap the contained dataframe back up in to the container class for each simple filter function. This may be bad for memory reasons, but it's the best solution I can think of so far.

I'm sure I'll run into some of the caveats mentioned in the above-linked SO answer, but this will work for now. I see many variations on this question (but not quite the same), so I hope this helps someone.

ADDED BONUS: Took me awhile to figure out how to get the data frames of a composed class to look nice in iPython, but the _repr_html_ function does the trick (note the single, not double, underscore).

Build stand-alone but composeable "atomic" filter functions for a SQL database/Pandas dataframe?

1 Answers1

Linked