3

Need to increase the sequential numbers if certain condition is met or else need to hold the previous number.

Original_dataset:

ID Name Status Cluster Gap
1 A 0 1 15
1 B 1 1 35
1 C 1 1 03
2 B 0 1 26
2 C 0 1 16
3 A 1 1 65
3 C 0 1 89
3 F 0 1 19

Required_Dataset:

ID Name Status Cluster Gap
1 A 0 1 15
1 B 1 2 35
1 C 1 3 03
2 B 0 1 26
2 C 0 1 16
3 A 1 1 65
3 C 0 2 89
3 F 0 2 19

Conditions:

  1. For first occurrence of ID, the cluster should be 1.
  2. If status = 1 or Gap > 28, then cluster needs to increase by 1 based on patient ID (see row 1-C and 2-B --- as ID changes, the Cluster remains 1 as it is the first occurrence of the particular ID).
  3. If the condition is not satisfied, it needs to hold the previous cluster number. (Can refer the final row).

The code which I have tried is:

Original_dataset.loc[((new_df4['gap'] > 28) | (Original_dataset['status'] == 1)),'Cluster'] = Original_dataset['Cluster'] + 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252

2 Answers2

2

First set 1 if greater GAP like 28 or if Status is 1 or first duplicated value of ID in DataFrame.loc, then use lambda function with GroupBy.cumsum and GroupBy.ffill

m = df['Gap'].gt(28)
m1 = df['Status'].eq(1)
m2 = ~df['ID'].duplicated()

df.loc[m | m1 | m2, 'Cluster1'] = 1

f = lambda x: x.cumsum().ffill(downcast='int')
df['Cluster1'] = df.groupby('ID')['Cluster1'].transform(f)
print (df)
   ID Name  Status  Cluster  Gap  Cluster1
0   1    A       0        1   15         1
1   1    B       1        1   35         2
2   1    C       1        1    3         3
3   2    B       0        1   26         1
4   2    C       0        1   16         1
5   3    A       1        1   65         1
6   3    C       0        1   89         2
7   3    F       0        1   19         2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

You can use a mask based on 3 conditions, and a simple groupby.cumsum:

# is Gap≄28 OR Status==1 OR first ID ?
m = df['Gap'].ge(28) | df['Status'].eq(1) | ~df['ID'].duplicated()

# then increment count per group
df['Cluster'] = m.groupby(df['ID']).cumsum()

Output:

   ID Name  Status  Cluster  Gap
0   1    A       0        1   15
1   1    B       1        2   35
2   1    C       1        3    3
3   2    B       0        1   26
4   2    C       0        1   16
5   3    A       1        1   65
6   3    C       0        2   89
7   3    F       0        2   19
mozway
  • 194,879
  • 13
  • 39
  • 75