18

I have sample snippet that works as expected:

import pandas as pd

df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)

The result is:

  label  wave  y     new
0     a     1  0    (1,)
1     b     2  0  (2, 3)
2     b     3  0  (2, 3)
3     c     4  0    (4,)

It works analagously, if instead of tuple in transform I give set, frozenset, dict, but if I give list I got completly unexpected result:

df['new'] = df.groupby(['label'])[['wave']].transform(list)

  label  wave  y  new
0     a     1  0    1
1     b     2  0    2
2     b     3  0    3
3     c     4  0    4

There is a workaround to get expected result:

df['new'] = df.groupby(['label'])[['wave']].transform(tuple)['wave'].apply(list)

  label  wave  y     new
0     a     1  0     [1]
1     b     2  0  [2, 3]
2     b     3  0  [2, 3]
3     c     4  0     [4]

I thought about mutability/immutability (list/tuple) but for set/frozenset it is consistent.

The question is why it works in this way?

Quant Christo
  • 1,275
  • 9
  • 23

5 Answers5

6

I've come across a similar issue before. The underlying issue I think is when the number of elements in the list matches the number of records in the group, it tries to unpack the list so each element of the list maps to a record in the group.

For example, this will cause the list to unpack, as the len of the list matches the length of each group:

df.groupby(['label'])[['wave']].transform(lambda x: list(x))
    wave
0   1
1   2
2   3
3   4

However, if the length of the list is not the same as each group, you will get desired behaviour:

df.groupby(['label'])[['wave']].transform(lambda x: list(x)+[0])

    wave
0   [1, 0]
1   [2, 3, 0]
2   [2, 3, 0]
3   [4, 0]

I think this is a side effect of the list unpacking functionality.

Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • 1
    Nice observation! Looks like this is an issue of how they internally represent intermediat results. But I think this should still be regarded an error. – jottbe Sep 01 '19 at 08:01
  • 1
    For me this behavior is confusing. If there is no easy remedy for this, then support for list in transform should be dropped or at least some warning should be added. – Quant Christo Sep 01 '19 at 08:22
  • @QuantChristo, I agree it's a very confusing behaviour and the similar issue I came across before took me a while to figure out why. Maybe you can file it as a bug. – Allen Qin Sep 01 '19 at 08:33
  • 2
    I've created an issue https://github.com/pandas-dev/pandas/issues/28246 – Quant Christo Sep 01 '19 at 08:53
  • Hi, I'm not that familiar with python. How does `[['wave']]` differ with `['wave']`? – Shinjo Sep 02 '19 at 04:08
  • 2
    @Shinjo It is pandas related, type(df[['wave']]) is DataFrame, type(df['wave']) is Series. – Quant Christo Sep 02 '19 at 11:14
3

I think that is a bug in pandas. Can you open a ticket on their github page please?

At first I thought, it might be, because list is just not handeled correctly as argument to .transform, but if I do:

def create_list(obj):
    print(type(obj))
    return obj.to_list()

df.groupby(['label'])[['wave']].transform(create_list)

I get the same unexpected result. If however the agg method is used, it works directly:

df.groupby(['label'])['wave'].agg(list)
Out[179]: 
label
a       [1]
b    [2, 3]
c       [4]
Name: wave, dtype: object

I can't imagine that this is intended behavior.

Btw. I also find the different behavior suspicious, that shows up if you apply tuple to a grouped series and a grouped dataframe. E.g. if transform is applied to a series instead of a DataFrame, the result also is not a series containing lists, but a series containing ints (remember for [['wave']] which creates a one-columed dataframe transform(tuple) indeed returned tuples):

df.groupby(['label'])['wave'].transform(tuple)
Out[177]: 
0    1
1    2
2    3
3    4
Name: wave, dtype: int64

If I do that again with agg instead of transform it works for both ['wave'] and [['wave']]

I was using version 0.25.0 on an ubuntu X86_64 system for my tests.

jottbe
  • 4,228
  • 1
  • 15
  • 31
  • I've created an issue https://github.com/pandas-dev/pandas/issues/28246 Thing regarding ['wave'] vs [['wave']] is your finding and it is separate issue, so it'd be better if you create issue only for this on github. Thanks a lot! – Quant Christo Sep 01 '19 at 08:57
  • Couldn't you just add it to your ticket? I would have to go through the whole process to set up a test case for that. – jottbe Sep 01 '19 at 13:49
3

Since DataFrames are mainly designed to handle 2D data, including arrays instead of scalar values might stumble upon a caveat such as this one.

pd.DataFrame.trasnform is originally implemented on top of .agg:

# pandas/core/generic.py
@Appender(_shared_docs["transform"] % dict(axis="", **_shared_doc_kwargs))
def transform(self, func, *args, **kwargs):
    result = self.agg(func, *args, **kwargs)
    if is_scalar(result) or len(result) != len(self):
        raise ValueError("transforms cannot produce " "aggregated results")

    return result

However, transform always return a DataFrame that must have the same length as self, which is essentially the input.

When you do an .agg function on the DataFrame, it works fine:

df.groupby('label')['wave'].agg(list)
label
a       [1]
b    [2, 3]
c       [4]
Name: wave, dtype: object

The problem gets introduced when transform tries to return a Series with the same length.

In the process to transforming a groupby element which is a slice from self and then concatenating this again, lists gets unpacked to the same length of index as @Allen mentioned.

However, when they don't align, then don't get unpacked:

df.groupby(['label'])[['wave']].transform(lambda x: list(x) + [1])
    wave
0   [1, 1]
1   [2, 3, 1]
2   [2, 3, 1]
3   [4, 1]

A workaround this problem might be avoiding transform:

df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df = df.merge(df.groupby('label')['wave'].agg(list).rename('new'), on='label')
df
    label   wave    y   new
0   a         1     0   [1]
1   b         2     0   [2, 3]
2   b         3     0   [2, 3]
3   c         4     0   [4]
iDrwish
  • 3,085
  • 1
  • 15
  • 24
  • Yeah, I know this workaround, until this issue I thought that: transform = groupby. apply + merge, just a synctatic sugar – Quant Christo Sep 01 '19 at 08:31
1

Another interesting work around, that works for strings, is:

df = df.applymap(str) # Make them all strings... would be best to use on non-numeric data.
df.groupby(['label'])['wave'].transform(' '.join).str.split()

Output:

0       [1]
1    [2, 3]
2    [2, 3]
3       [4]
Name: wave, dtype: object
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
0

The suggested answers does not work on Pandas 1.2.4 anymore. Here is a workaround for it:

df.groupby(['label'])[['wave']].transform(lambda x: [list(x) + [1]]*len(x))

The idea behind it is the same as explained in other answers (e.g. @Allen's answer). Therefore, the solution here is wrap the function into another list and repeat it same number as the group length, so that when pandas transform unwraps it, each row gets the inside list.

output:

    wave
0   [1, 1]
1   [2, 3, 1]
2   [2, 3, 1]
3   [4, 1]
Ehsan
  • 12,072
  • 2
  • 20
  • 33
  • Isnt this misleading as you are literally adding an item to the list? – theStud54 Jun 05 '22 at 13:38
  • @theStud54 I am not sure what the 'misleading' part is. Please refer to Allen Qin's answer to see the underlying reasoning and why you need to add an item to the list. Depending on the application, this may or may not be useful for you. – Ehsan Jun 05 '22 at 23:05