Okay, I asked this: Functional chaining / composing filter functions of DataFrame in Python? and it was erroneously marked duplicate.
So we're trying again:
What I have is a bunch of data that I can load as a SQL table or a Pandas dataframe. What I'd like to do is offer a bunch of simple filter functions that can be composed (but I will not know the order of composition until run time). So I want to offer the use an as-syntatically-simple-as-possible interface to these functions.
Ideally, I'd like to be able to do is offer the user a toolbox of functions (say, is_size, is_red, max_price, for_male, for_female, is_shirt, etc.) and then let them mix and match those how they'd like to get their result:
result = my_clothes.is_red().is_size('large').max_price(100).for_male()
which, of course, would return the same as
result = my_clothes.max_price(100).is_size('large').for_male().is_red(), etc.
Now, as I stated in the previous question, I can do this in Pandas using pipes:
def get_color(df, color):
return df[df['color'] == color]
def is_shirt(df):
return df[df['shirt'] == True]
(poll.pipe(is_shirt)
.pipe(get_color, color=red)
)
That's a little ugly syntactically for the audience this library is intended.
I also figured out a way to build a class around the dataframe, which has a "chain" member, that gets built, and then calls a "done" function that alerts that we're returning the dataframe we've constructed:
class wrapper_for_dataframe():
my_data = pd.DataDrame() # the actual data
chain = pd.DataFrame()
def is_shirt_chain(self):
self.chain = self.chain[self.chain.type == 'shirt']
return self
def max_price_chain(self, price):
self.chain = self.chain[self.chain.price < price]
return self
def done(self):
temp = self.chain.copy()
self.chain = self.schedule.copy()
#something to reset/delete self.chain
return temp
So, with that, I can do things like:
result = wrapper_for_dataframe_instance.is_shirt_chain().max_price_chain(200).done()
(Note: the above notation may not be 100% correct; it's simplified from something else I built, but I can get that to work)
So this is closer, but this suffers from the normal problems of when you build a wrapper around a dataframe; it's sort of bitch to do "normal" pandas stuff with the DF (seemingly you have to build a function for everything, although I think there's probably a way to pass normal Pandas functions to the underlying dataframe).
There are a number of other reasons why this is bad (what happens if you have more than 1 chain at a time? Chaos? Is having an "Extra" copy of the data a good idea? Probably not)
So, is there another way of doing this? I think Django has this facility, but that's a little heavyweight.
Another thought was SQLalchemy; I could shift the whole thing out of Pandas and in to the SQL realm and build functions that make use of the or_ function and SQLalchemy filtering (like this: Using OR in SQLAlchemy). But that means I have to learn SQLalchemy (which I will do, if that's the best solution here).
Anyway, any ideas? Thanks.