16

Having the following data frame, group A have 4 samples, B 3 samples and C 1 sample:

  group   data_1   data_2
0     A        1        4
1     A        2        5
2     A        3        6
3     A        4        7
4     B        1        4
5     B        2        5
6     B        3        6
7     C        1        4

I would like to transform the data into numpy array, where each row is a group with all its samples and zero padding for groups that have fewer samples.

Resulting in an array like so:

[
   [[1,4],[2,5],[3,6],[4,7]], # this is A group 4 samples
   [[1,4],[2,5],[3,6],[0,0]], # this is B group 3 samples
   [[1,4],[0,0],[0,0],[0,0]], # this is C group 1 sample
]
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186

1 Answers1

20

First is necessary add missing values - first solution with unstack and stack, counter Series is created by cumcount.

Second solution use reindex by MultiIndex.

Last use lambda function with groupby, convert to numpy array by values and last to lists:

g = df.groupby('group').cumcount()
L = (df.set_index(['group',g])
       .unstack(fill_value=0)
       .stack().groupby(level=0)
       .apply(lambda x: x.values.tolist())
       .tolist())
print (L)

[[[1, 4], [2, 5], [3, 6], [4, 7]], 
 [[1, 4], [2, 5], [3, 6], [0, 0]], 
 [[1, 4], [0, 0], [0, 0], [0, 0]]]

Another solution:

g = df.groupby('group').cumcount()
mux = pd.MultiIndex.from_product([df['group'].unique(), g.unique()])
L = (df.set_index(['group',g])
       .reindex(mux, fill_value=0)
       .groupby(level=0)['data_1','data_2']
       .apply(lambda x: x.values.tolist())
       .tolist()
)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252