2

Let's assume I've got a DataFrame as below:

import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [30, 30]),
                             'min': np.concatenate([np.repeat([1, 2, 3, 4, 5], [6, 6, 6, 6, 6]), 
                                                    np.repeat([1, 2, 3, 4, 5], [6, 6, 6, 6, 6])]),
                             'obj': list('ABCDEF' * 10),
                             'var': [1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2,
                                     1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2,                                                           
                                     2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2,
                                     2, 1, 1, 1, 2, 1]})

So there are 2 measurements that measure some var value for 6 objects in one minute breaks. What I'd like to do is to create a column that would count how long (how many minutes) each var value wasn't changing. I think it could be achieved by appropriate groupby, equivalent to R function data.table::rleid that creates run-length type id column and finally counting how many minutes were within one value of the resulting column.

So the resulting output would be as follows:

df['rleid_output'] = np.concatenate([[1] * 18, [2] * 12, [3] * 6, [4] * 18, [5] * 6])
df['expected_output'] = np.concatenate([[3] * 18, [2] * 12, [1] * 6, [3] * 18, [1] * 6])

Unfortunetely, I don't know how can it can be done in Pandas. I'm struggling both with creating rleid equivalent as well as obtaining expected_output then. Any ideas how expected_output could be achieved, no matter if using rleid equivalent or not?

@Edit: To answer comments about the logic of 'expected_output', here's how I compute it in my head:

  • For 1st minute within 1st measurement we have following pairs of obj and var: (A, 1), (B, 1), (C, 2), (D, 2), (E, 2), (F, 2)
  • In 2nd and 3rd minute of measurement 1 those pairs remain unchanged, so the first assignment last for at least 3 minutes
  • In 4th minute var value changes to 1 for obj = C. At least one pair has changed, so our whole assignment is treated as changed and starts new value of rleid_output. At the same time we already know that previous assignment lasted for exactly 3 minutes and value of expected_output should be set to 3 for all previous rows. Current assignment (obj, var) is (A, 1), (B, 1), (C, 1), (D, 2), (E, 2), (F, 2).
  • In 5th minute assignment didn't changed. Next group of (measurement_id, min) is actually new measurement, so as it was our final minute we know already that we should set expected_output to 2 in rows having (measurement_id, min) == (1, 4) and (measurement_id, min) == (1, 5) as assignment was the same for 2 minutes.
  • Next group of rows is new measurement (measurement_id, min) == (2, 1). It starts with assignment (A, 2), (B, 2), (C, 1), (D, 1), (E, 2), (F, 2). It changes in the next minute so we can mark it as the assignment that lasts for only 1 minute (value of expected_output).
  • Next assignment is (A, 2), (B, 1), (C, 1), (D, 1), (E, 2), (F, 2) and it lasts for 3 minutes - that's why expected_output = 3 for rows from 36 to 53.
  • In final, 5th minute of measurement 2, F changes var value to 2, so we have new assignment ((A, 2), (B, 1), (C, 1), (D, 1), (E, 2), (F, 1)) that lasts for 1 minute only as we don't have further measurements.

I hope it's clearer now.

Also, it would be good to know how it can be done with individual assignment changes, i.e. if we don't treat assignment as a whole, but rather individual pairs e.g. (A, 2)

@Edit2: Shortened example, as requested, below:

df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
                         'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]), 
                                                np.repeat([1, 2, 3], [2, 2, 2])]),
                         'obj': list('AB' * 6),
                         'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4]
df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1]
jakes
  • 1,964
  • 3
  • 18
  • 50
  • `expected_output` just counts how many minutes values of `var` remains unchanged. So you can obtain it as follows: if `var` value for any `obj` has changed in comparison with the previous minute, you get new value of `rleid_output`. Then `expected_output` just counts unique values of `min` within the same value of `rleid_output`. – jakes Dec 26 '19 at 11:30
  • I'm still not clear what `rleid_output` and `expected_output` is supposed to be? Can you may be add a simple example (+ walkthrough) of how `rleid_output` and `expected_output` are computed? – thushv89 Dec 26 '19 at 12:58
  • 1
    @thushv89, I have edited the post to incorporate detailed workthrough. – jakes Dec 26 '19 at 14:48
  • 1
    @anky_91, as requested, I have put the example shortened to 12 rows (I think it's minimal to incorporate all quirks) at the bottom – jakes Dec 26 '19 at 18:35
  • 1
    @anky_91, those question are different (please note the difference in `rleid_output` and `expected_output`). Here, the new run id should be assigned to all `obj` within the same group in case of `measurement_id` or `var` has changed for any `obj`. Could you please reopen? – jakes Dec 27 '19 at 14:10
  • @jakes i see, repoened – anky Dec 27 '19 at 14:10

0 Answers0