Let's assume I've got a DataFrame
as below:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [30, 30]),
'min': np.concatenate([np.repeat([1, 2, 3, 4, 5], [6, 6, 6, 6, 6]),
np.repeat([1, 2, 3, 4, 5], [6, 6, 6, 6, 6])]),
'obj': list('ABCDEF' * 10),
'var': [1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2,
1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2,
2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2,
2, 1, 1, 1, 2, 1]})
So there are 2 measurements that measure some var
value for 6 objects in one minute breaks. What I'd like to do is to create a column that would count how long (how many minutes) each var
value wasn't changing. I think it could be achieved by appropriate groupby
, equivalent to R function data.table::rleid
that creates run-length type id column and finally counting how many minutes were within one value of the resulting column.
So the resulting output would be as follows:
df['rleid_output'] = np.concatenate([[1] * 18, [2] * 12, [3] * 6, [4] * 18, [5] * 6])
df['expected_output'] = np.concatenate([[3] * 18, [2] * 12, [1] * 6, [3] * 18, [1] * 6])
Unfortunetely, I don't know how can it can be done in Pandas. I'm struggling both with creating rleid
equivalent as well as obtaining expected_output
then. Any ideas how expected_output
could be achieved, no matter if using rleid
equivalent or not?
@Edit: To answer comments about the logic of 'expected_output', here's how I compute it in my head:
- For 1st minute within 1st measurement we have following pairs of
obj
andvar
: (A, 1), (B, 1), (C, 2), (D, 2), (E, 2), (F, 2) - In 2nd and 3rd minute of measurement 1 those pairs remain unchanged, so the first assignment last for at least 3 minutes
- In 4th minute
var
value changes to 1 forobj = C
. At least one pair has changed, so our whole assignment is treated as changed and starts new value ofrleid_output
. At the same time we already know that previous assignment lasted for exactly 3 minutes and value ofexpected_output
should be set to 3 for all previous rows. Current assignment (obj
,var
) is (A, 1), (B, 1), (C, 1), (D, 2), (E, 2), (F, 2). - In 5th minute assignment didn't changed. Next group of (
measurement_id
,min
) is actually new measurement, so as it was our final minute we know already that we should setexpected_output
to 2 in rows having(measurement_id, min) == (1, 4)
and(measurement_id, min) == (1, 5)
as assignment was the same for 2 minutes. - Next group of rows is new measurement
(measurement_id, min) == (2, 1)
. It starts with assignment (A, 2), (B, 2), (C, 1), (D, 1), (E, 2), (F, 2). It changes in the next minute so we can mark it as the assignment that lasts for only 1 minute (value ofexpected_output
). - Next assignment is (A, 2), (B, 1), (C, 1), (D, 1), (E, 2), (F, 2) and it lasts for 3 minutes - that's why
expected_output = 3
for rows from 36 to 53. - In final, 5th minute of measurement 2, F changes
var
value to 2, so we have new assignment ((A, 2), (B, 1), (C, 1), (D, 1), (E, 2), (F, 1)) that lasts for 1 minute only as we don't have further measurements.
I hope it's clearer now.
Also, it would be good to know how it can be done with individual assignment changes, i.e. if we don't treat assignment as a whole, but rather individual pairs e.g. (A, 2)
@Edit2: Shortened example, as requested, below:
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]),
np.repeat([1, 2, 3], [2, 2, 2])]),
'obj': list('AB' * 6),
'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4]
df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1]