6

Note: The thread below prompted a pull request which was eventually merged into v1.10. This issue is now resolved.

I'm using a subclassed DataFrame so that I can have more convenient access to some transformation methods and metadata particular to my use-case. Most of the DataFrame operations work as expected, in that they return an instance of the subclass, rather than an instance of pandas.DataFrame. However, aggregation operations like DataFrame.groupby and DataFrame.resample seem to mess this up.

Is this a bug, or have a missed something when defining my subclass?

Below is a minimal example, tested on pandas 0.25.1:

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame

dates = pd.date_range('2019', freq='D', periods=365)
my_df = MyDataFrame(range(len(dates)), index=dates)

assert isinstance(my_df, MyDataFrame)
# Success!

assert isinstance(my_df.diff(), MyDataFrame)
# Success!

assert isinstance(my_df.sample(10), MyDataFrame)
# Success!

assert isinstance(my_df[:10], MyDataFrame)
# Success!

assert isinstance(my_df.resample("D").sum(), MyDataFrame)
# AssertionError

assert isinstance(my_df.groupby(df.index.month).sum(), MyDataFrame)
# AssertionError
grge
  • 135
  • 10
  • From the docs, GroupBy returns a DataFrameGroupBy or SeriesGroupBy object. Resample returns a Resampler object – mgrollins Sep 06 '19 at 23:17
  • Right, so, one option would be to subclass DataFrameGroupBy as well, and then somehow tell pandas to use the correct constructor when calling groupby on instances of MyDataFrame? Perhaps I could override groupby in my subclass, but then I would need to do the same for other aggregation methods (e.g. rolling, resample, expanding). It just seems like there might be an officially "intended" solution to this problem. – grge Sep 06 '19 at 23:35

1 Answers1

4

I don't know if it's a "bug" per-se, but I agree that it should be changed regardless. If you take a look at some of the source code for groupby-type objects, you'll see a lot of hardcoded return DataFrame(...) and return Series(...).

As you rightfully pointed out, Pandas objects have three methods to be used to construct new versions of themselves:

  • _construct() to create objects of the same type
  • _construct_sliced() to create a series-like object from a dataframe-like object
  • _construct_expanddim() to create a dataframe-like object from a series-like object

These can be used instead of the hardcoded types in core/groupby/generic.py, which is easy to do since the groupby objects store the starting NDFrame as the attribute obj.

A branch with these changes implemented can be found on my fork here: https://github.com/alkasm/pandas/tree/groupby-preserve-subclass

alkasm
  • 22,094
  • 5
  • 78
  • 94
  • Thanks. It certainly does appear that this functionality is not yet supported. Further evidence for this is that GeoPandas, which makes quite advanced use of DataFrame subclassing, has the same behaviour. Re. the sentence "pandas couldn't possibly know how to make a compatible Series", I don't think this is insurmountable. If the subclass object or class was passed into the DataFrameGroupBy, then it could use the `_constructor_sliced` (see [here](https://pandas.pydata.org/pandas-docs/stable/development/extending.html#override-constructor-properties)) method to create a compatible Series type. – grge Sep 07 '19 at 02:26
  • @grge wow, that is a nice find! If that's the case, a solution that could potentially make its way as a PR is just to store the type that the groupby operation was completed on, and use those methods instead of hardcoding it. I would suggest opening up an issue on [pandas-dev](https://github.com/pandas-dev/pandas). – alkasm Sep 07 '19 at 02:44
  • I raised an issue here: https://github.com/pandas-dev/pandas/issues/28330 – grge Sep 07 '19 at 03:59
  • I'm struggling with this issue now and see that the PR was apparently never merged into master. @alkasm, would you recommend monkey patching this commit into my project for the time being: https://github.com/alkasm/pandas/commit/63936681803da0138489b5b537a926bb98b7b2c1 – pandichef Apr 15 '20 at 03:58
  • 1
    @pandichef annoyingly after I worked on the commit I was asked to shuffle my tests around, at which point my commits became hugely incongruent with a big refactor of group-by things, and afterwards I could not figure out how to easily fix the behavior in all cases (see here: https://github.com/pandas-dev/pandas/pull/28573#issuecomment-553862758). You should just be able to cast the dataframe back to your subclassed type, which seems easier to maintain than a stale attempted contribution fork tbh. – alkasm Apr 15 '20 at 07:42
  • @alkasm The problem with "casting the dataframe back to the subclassed type" is that it doesn't preserve dynamically created object attributes e.g., an attribute that contains column formats for df.to_html. Have you ever come across this? If so, how would you solve this problem? – pandichef Apr 15 '20 at 16:15
  • 1
    @pandichef Hm, that does add complexity. Of course you can still cast and add the attributes again but that's definitely brittle. This isn't a good location for this discussion, though I'm happy to chat more---send me an email (check my Github)! I can possibly try to pick up this PR again. – alkasm Apr 15 '20 at 19:29