Pandas - group by column and transform the data to numpy array

Question

Having the following data frame, group A have 4 samples, B 3 samples and C 1 sample:

  group   data_1   data_2
0     A        1        4
1     A        2        5
2     A        3        6
3     A        4        7
4     B        1        4
5     B        2        5
6     B        3        6
7     C        1        4

I would like to transform the data into numpy array, where each row is a group with all its samples and zero padding for groups that have fewer samples.

Resulting in an array like so:

[
   [[1,4],[2,5],[3,6],[4,7]], # this is A group 4 samples
   [[1,4],[2,5],[3,6],[0,0]], # this is B group 3 samples
   [[1,4],[0,0],[0,0],[0,0]], # this is C group 1 sample
]

jezrael · Accepted Answer · 2018-10-03T07:24:00.890

20

First is necessary add missing values - first solution with unstack and stack, counter Series is created by cumcount.

Second solution use reindex by MultiIndex.

Last use lambda function with groupby, convert to numpy array by values and last to lists:

g = df.groupby('group').cumcount()
L = (df.set_index(['group',g])
       .unstack(fill_value=0)
       .stack().groupby(level=0)
       .apply(lambda x: x.values.tolist())
       .tolist())
print (L)

[[[1, 4], [2, 5], [3, 6], [4, 7]], 
 [[1, 4], [2, 5], [3, 6], [0, 0]], 
 [[1, 4], [0, 0], [0, 0], [0, 0]]]

Another solution:

g = df.groupby('group').cumcount()
mux = pd.MultiIndex.from_product([df['group'].unique(), g.unique()])
L = (df.set_index(['group',g])
       .reindex(mux, fill_value=0)
       .groupby(level=0)['data_1','data_2']
       .apply(lambda x: x.values.tolist())
       .tolist()
)

edited Oct 03 '18 at 07:24

answered Oct 03 '18 at 07:10

jezrael

822,522
95
1,334
1,252

Thanks for the reply, but I get an error KeyError: "Columns not found: 'data_2', 'data_1'" – Shlomi Schwartz Oct 03 '18 at 07:15
@ShlomiSchwartz - What is `print (df.columns.tolist())` ? – jezrael Oct 03 '18 at 07:16
.stack().groupby(level=0)['data_1','data_2'] causes the error, before that line the df looks like so ` data_1 data_2 0 1 2 3 0 1 2 3 group A 1 2 3 4 4 5 6 7 B 1 2 3 0 4 5 6 0 C 1 0 0 0 4 0 0 0` – Shlomi Schwartz Oct 03 '18 at 07:19
@ShlomiSchwartz - Just realised it is not necessary `['data_1','data_2']` – jezrael Oct 03 '18 at 07:21
could you please have a look here: https://stackoverflow.com/questions/52735334/python-resample-dataset-to-have-balanced-classes ? – Shlomi Schwartz Oct 10 '18 at 08:27
1

@ShlomiSchwartz - I check ed it, but unfortuantely I have no experiece with it :( So I upvote for highlight your question. – jezrael Oct 10 '18 at 08:37
Hi, what should I change in the code if I don't want to fill values and would rather have a jagged nested list instead? – Saket Kumar Singh Feb 20 '19 at 11:15
1

@SaketKumarSingh - Try remove `.reindex(mux, fill_value=0)` – jezrael Feb 20 '19 at 11:16

Pandas - group by column and transform the data to numpy array

1 Answers1

Linked