R group_by() + rleid() equivalent in Python

Question

I've got a following data frame in Python:

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})

First, within each group defined by object, I'd like to assign id to unique run of measurement_id and var columns. If any value of those columns changes, it starts new run that should be assigned with new id. So the

df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 3]

Then, for each group defined by rleid_output I'd like to check how many minutes (min column) the run lasted giving me expected_output column:

df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 3, 2, 3, 1, 3]

If it was R, I'd proceed as follows:

df <- data.frame(measurement_id = rep(1:2, each = 6),
           min = rep(rep(1:3, each = 2), 2),
           object = rep(LETTERS[1:2], 6),
           var = c(1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1))
df %>% 
  group_by(object) %>% 
  mutate(rleid = data.table::rleid(measurement_id, var)) %>% 
  group_by(object, rleid) %>% 
  mutate(expected_output = last(min) - first(min) + 1)

So the main thing I need is R data.table::rleid equivalent that would work with Python pd.DataFrame.groupby clause. Any ideas how to solve this?

@Edit: new, updated example of data frame:

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 2, 1, 3, 2, 4, 3, 4, 3, 5, 3]
df['expected_output'] = [1, 2, 1, 2, 1, 1, 2, 3, 2, 3, 1, 3]

Erfan · Answer 1 · 2019-12-27T16:18:33.667

Updated answer

The problem is that the min column in each group of measurement_id, obj, var should be maintained order. We can check this by group by on measurement_id, obj, var and then checking if the difference in min column is greater than 1. If so, we mark it as a unique duration in expected_output:

df['grouper'] = (df.groupby(['measurement_id', 'obj', 'var'])['min']
                 .apply(lambda x: x.diff().fillna(1).eq(1))
                )

df['expected_output'] = (
    df.groupby(['measurement_id', 'obj', 'var'])['grouper'].transform('sum').astype(int)
)

df = df.drop(columns='grouper')

    measurement_id  min obj  var  expected_output
0                1    1   A    1                1
1                1    1   B    2                2
2                1    2   A    2                1
3                1    2   B    2                2
4                1    3   A    1                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3

Old answer, following OP's logic

We can achieve this by using GroupBy.diff to get your rleid_output, basically a unique identifier each time var changes for each measurement_id& obj

After that using GroupBy.nunique to measure the amount of minutes:

rleid_output = df.groupby(['measurement_id', 'obj'])['var'].diff().abs().bfill()
df['expected_output'] = (df.groupby(['measurement_id', 'obj', rleid_output])['min']
                         .transform('nunique'))

    measurement_id  min obj  var  expected_output
0                1    1   A    1                2
1                1    1   B    2                2
2                1    2   A    1                2
3                1    2   B    2                2
4                1    3   A    2                1
5                1    3   B    1                1
6                2    1   A    2                2
7                2    1   B    1                3
8                2    2   A    2                2
9                2    2   B    1                3
10               2    3   A    1                1
11               2    3   B    1                3

There are two traps in your original solution (marked as old answer now) - an use of `bfill` to fill `NaN` values and `nunique` for counting minutes. The latter would fail if some intermediate values of `min` are lacking and the former would fail for if the value of `var` has changed in second `min`, e.g. `'var': [1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1]` — jakes, Dec 27 '19 at 14:31
Also, your updated answer would fail if one object, let's say `A`, has value of `var` equal to `1` in 1st minute, then equal to `2` in 2nd minute, and equal to `1` again in 3rd minute. — jakes, Dec 27 '19 at 14:35
Can you update your answer with the new dataframe where my answer will not suffice, so I can adjust accordingly. — Erfan, Dec 27 '19 at 14:41
Does your expected output also change with the new example df? — Erfan, Dec 27 '19 at 14:51
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/204952/discussion-between-erfan-and-jakes). — Erfan, Dec 27 '19 at 15:24

score 2 · Answer 2 · answered Dec 27 '19 at 20:38

To mimic the behaviour of R rleid function one can firstly create an artificial column that checks if current value has changed in comparison to previous one or not. In this case, we should do this on grouped var Series:

var_grpd = df.groupby(['measurement_id', 'obj'])['var']
df['tmp'] = (var_grpd.shift(0) != var_grpd.shift(1))

Then, we can use this artificial tmp column to obtain rleid_output2. After that, tmp column is no longer needed.

df['rleid_output2'] = df.groupby('obj')['tmp'].cumsum().astype(int)
df.drop('tmp', axis = 1, inplace = True)

Finally, to check how many minutes var value lasts, we can calculate the difference between last and first minute within a group.

df['expected_output2'] = df.groupby(['obj', 'rleid_output2'])['min'] \
                           .transform(lambda x: x.iat[-1] - x.iat[0] + 1)

.iat is similar to .iloc but allows us to access single value in DataFrame or Series.

R group_by() + rleid() equivalent in Python

2 Answers2

Linked