1

We can think of applying two types of functions to a Pandas Series: transformations and aggregations. They make this distinction in the documentation; transformations map individual values in the Series while aggregations somehow summarize the entire Series (e.g. mean).

It is clear how to apply transformations using apply, but I have not be successful in implementing a custom aggregation. Note that groupby is not involved, and aggregation does not require a groupby.

I am working with the following case: I have a Series in which each row is a list of strings. One way I could aggregate this data is to count up the number of appearances of each string, and return the 5 most common terms.

def top_five_strings(series):
    counter = {}
    for row in series:
        for s in row:
            if s in counter:
                counter[s] += 1
            else:
                counter[s] = 1

    return sorted(s.items(), key=lambda x: x[1])[:5]

If I call this function as top_five_strings(series), it works fine, analogous to as if I had called np.mean(series) on a numeric series. However, the difference is I can also do series.agg(np.mean) and get the same result. If I do series.agg(top_five_strings), I instead get the top five letters of in each row of the Series (which makes sense if you make a single row the argument of the function).

I think the critical difference is that np.mean is a NumPy ufunc, but I haven't been able to work out how the _aggregate helper function works in the Pandas source.

I'm left with 2 questions:

1) Can I implement this by making my Python function a ufunc (and if so, how)?

2) Is this a stupid thing to do? I haven't found anyone else out there trying to do something like this. It seems to me like it would be quite nice, however, to be able to implement custom aggregations as well as custom transformations within the Pandas framework (e.g. I get a Series as a result as one might with df.describe).

Dylan B
  • 173
  • 1
  • 1
  • 13

0 Answers0