1
   x_TimeGenerated  batch_number
      22/03/2023    BATCH_1
      21/03/2023    BATCH_1
      20/03/2023    BATCH_1
      19/03/2023    BATCH_1
      18/03/2023    BATCH_1
      18/03/2023    BATCH_2
      17/03/2023    BATCH_2
      16/03/2023    BATCH_2
      15/03/2023    BATCH_2
      14/03/2023    BATCH_2
      14/03/2023    BATCH_3
      13/03/2023    BATCH_3
      12/03/2023    BATCH_3
      11/03/2023    BATCH_3
      10/03/2023    BATCH_3

I am processing data in batches & there are duplicate dates in each of the consecutive batches. For example both batch_1 & batch_2 has 18/03/2023 & batch_2 & batch_3 both has 14/03/2023.There are multiple records associated with each date so 14/03/2023 for instance has 250 records & so on. I want to remove all records of 18/03/2023 from batch_2 & 14/03/2023 from batch_3 to get rid of duplicate records. How can this be done ?

Lopa
  • 51
  • 6

1 Answers1

1

I can suggest the following. Create a 'col_new' column for grouping (where numbers start from 1). We group by it using a slice by 'split' (inclusive up to the penultimate group). In 'ind' we get the last row index of the next group. In 'aaa', we take through the data through the slice and get duplicates by setting these lines to True for the 'dr' column. Then use the 'dr' column with a tilde ~ (take rows that are not true).

import pandas as pd

df['col_new'] = (df['batch_number'] != df['batch_number'].shift()).astype(int).cumsum()
df['dr'] = False
split = df[df['col_new'] == df.loc[len(df) - 1, 'col_new']].index[0]


def my_func(x):
    ind = df[df['col_new'] == x['col_new'].values[0] + 1].index[-1]
    aaa = df[x.index[0]:ind]
    aaa = aaa[aaa.loc[x.index[0]:ind, 'x_TimeGenerated'].duplicated()].index
    df.loc[aaa, 'dr'] = True


df[:split].groupby('col_new').apply(my_func)

df = df[~df['dr']].reset_index(drop=True)#If the columns 'col_new', 'dr' are not needed, then comment out this line and uncomment the next one.
#df = df[~df['dr']].reset_index(drop=True).drop(columns=['col_new', 'dr'])

print(df)

Output

   x_TimeGenerated batch_number  col_new     dr
0       22/03/2023      BATCH_1        1  False
1       21/03/2023      BATCH_1        1  False
2       20/03/2023      BATCH_1        1  False
3       19/03/2023      BATCH_1        1  False
4       18/03/2023      BATCH_1        1  False
5       17/03/2023      BATCH_2        2  False
6       16/03/2023      BATCH_2        2  False
7       15/03/2023      BATCH_2        2  False
8       14/03/2023      BATCH_2        2  False
9       13/03/2023      BATCH_3        3  False
10      12/03/2023      BATCH_3        3  False
11      11/03/2023      BATCH_3        3  False
12      10/03/2023      BATCH_3        3  False
inquirer
  • 4,286
  • 2
  • 9
  • 16