1

My input is a pandas DataFrame :

   item foo_x foo_y bar_x bar_y
0     1     A     B     C     D
1     2     D     E     F     G
2     3     H     I     J     K
3     4     L     M     N     O

df = pd.DataFrame({'item': [1, 2, 3, 4],
 'foo_x': ['A', 'D', 'H', 'L'],
 'foo_y': ['B', 'E', 'I', 'M'],
 'bar_x': ['C', 'F', 'J', 'N'],
 'bar_y': ['D', 'G', 'K', 'O']})

I'm not asking too much to the groupby method, I only expect this standard aggregation :

   item       x       y
0     1  [A, C]  [B, D]
1     2  [D, F]  [E, G]
2     3  [H, J]  [I, K]
3     4  [L, N]  [M, O]

But my code below gives a nonsense error :

df_output = (
    df.rename(lambda x: x.split("_")[-1], axis=1)
        .groupby(level=0, axis=1).agg(list)
)

ValueError: Length of values (2) does not match length of index (4)

To be honest, this is absolutely counterintuitive based on how we're used to apply groupby(..., axis=0).

Can you please explain the logic behind ?

2 Answers2

3

The issue is that iterating over a DataFrame yields the column names:

list(pd.DataFrame({'A': [1, 2], 'B': [3, 4]}))
# ['A', 'B']

Using a small print hack to see what's going on in our groupby:

(df.rename(lambda x: x.split("_")[-1], axis=1)
   .groupby(level=0, axis=1).agg(lambda x: print(list(x)))
)

Printed output:

['item']
['x', 'x']
['y', 'y']

To avoid that, you need to convert to numpy:

df_output = (
    df.rename(lambda x: x.split("_")[-1], axis=1)
      .groupby(level=0, axis=1).agg(lambda x: x.to_numpy().tolist())
)

Output:

  item       x       y
0  [1]  [A, C]  [B, D]
1  [2]  [D, F]  [E, G]
2  [3]  [H, J]  [I, K]
3  [4]  [L, N]  [M, O]

excluding columns from the groupby

Simply set them temporarily as index:

cols = ['item']

df_output = (df
   .set_index(cols, append=True)
   .rename(lambda x: x.split("_")[-1], axis=1)
   .groupby(level=0, axis=1).agg(lambda x: x.to_numpy().tolist())
   .reset_index(cols)
)

Output:

   item       x       y
0     1  [A, C]  [B, D]
1     2  [D, F]  [E, G]
2     3  [H, J]  [I, K]
3     4  [L, N]  [M, O]
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thank you so much, now the error message makes more sense but I still wonder why when grouping on `axis=0`, we don't need to use `to.numpy()` as an intermediate. For example, if there was duplicated items and the `item` as an index, we'll only use `df.groupby(level=0, axis=0).agg(list)`. –  Aug 01 '23 at 08:49
  • 1
    @rendezvous I know, this is counter-intuitive, but the way it work now… I could probably be changed by design, but that's something that should be discussed in the pandas development tracker ;) – mozway Aug 01 '23 at 08:51
  • With your explanations, I'll keep trying until I understand the difference between the 2, thank you again. –  Aug 01 '23 at 08:56
  • 1
    I knew I [arleady had been asked this question](https://stackoverflow.com/a/76383255/16343464) – mozway Aug 01 '23 at 08:56
  • Nit: In OPs example,`item` is like `1`, `2`, `...`, but in this answer, it is `[1]`, `[2]`, `...`. – Sash Sinha Aug 01 '23 at 08:56
  • 1
    @mozway, now Ynjxsjmh's question has its own reference. @SashSinha, I can say that in this particular situation, for me, understanding the `groupby` behaviour matters more than the exact expected output. –  Aug 01 '23 at 08:59
  • 2
    @SashSinha if this is an issue it's easy to handle, see update ;) – mozway Aug 01 '23 at 09:01
1

Can you try the following:

df['x'] = df[['foo_x', 'bar_x']].values.tolist()
df['y'] = df[['foo_y', 'bar_y']].values.tolist()
Jeril
  • 7,858
  • 3
  • 52
  • 69