Groupby into list for non consecutive values

Question

I am trying to group by this dataset

    col1    col2
0   A   1
1   B   1
2   C   1
3   D   3
4   E   3
5   F   2
6   G   2
7   H   1
8   I   1
9   j   2
10  K   2

into this

1 : [A, B, C]
3: [D, E]
2: [ F; G]
1: [ H, I]
2: [ J,K]

so it has to capture the difference in appearances of the elements and not group all at once.

So far I was able to do the normal groupby, df.groupby("col2")["col1"].apply(list) but it isn't correct.

jezrael · Accepted Answer · 2021-01-28T09:09:49.367

4

You need distinguish consecutive values by compare shifted values foe not equal with cumulative sum, last remove second level of MultiIndex:

s = (df.groupby(["col2", df["col2"].ne(df["col2"].shift()).cumsum()])["col1"]
       .agg(list)
       .reset_index(level=1, drop=True))

edited Jan 28 '21 at 09:09

answered Jan 28 '21 at 08:54

jezrael

822,522
95
1,334
1,252

Any performance difference between `agg` and `apply`? – Sayandip Dutta Jan 28 '21 at 09:06
2

@SayandipDutta - Not easy question, I guess `agg` should be faster, because `apply` is more general. – jezrael Jan 28 '21 at 09:08
2

@SayandipDutta - Here should be better use `agg` for general solution if want change `["col1"]` to `["col1", "col3", 'colN']` – jezrael Jan 28 '21 at 09:09
@jezrael is there a way to keep the ordering similar to the one in the original data? – ombk Jan 28 '21 at 09:48
1

@ombk Yes, use `.groupby(["col2", df["col2"].ne(df["col2"].shift()).cumsum()], sort=False)["col1"]` instead `.groupby(["col2", df["col2"].ne(df["col2"].shift()).cumsum()])["col1"]` – jezrael Jan 28 '21 at 09:49
@jezrael: I jsut wanted to ask you if it's done only by using python `itertools groupby `will it be faster/efficient than using pandas functionality ? – Pygirl Jan 28 '21 at 10:01
1

@Pygirl - It should be faster, if large data maybe pandas wins, maybe the best test in real data – jezrael Jan 28 '21 at 10:04

score 1 · Answer 2 · answered Jan 28 '21 at 09:59

Since Jezrael already answered is using pandas. I would like to add non pandas method.

I know this is not an efficient method but for learning purpose I included.

Using itertools's groupby

from itertools import groupby
last_index = 0
for v, g in groupby(enumerate(df.col2), lambda k: k[1]):
    l = [*g]
    print(df.iloc[last_index]['col2'],':', df.iloc[last_index:l[-1][0]+1]['col1'].values)
    last_index += len(l)

1 : ['A' 'B' 'C']
3 : ['D' 'E']
2 : ['F' 'G']
1 : ['H' 'I']
2 : ['j' 'K']

It will maintain the order of their occurence. – Pygirl Jan 28 '21 at 10:02 — Pygirl, Jan 28 '21 at 10:02

Groupby into list for non consecutive values

2 Answers2