3

I tried looking up for something similar but could find it. So I have below structure dataframe. I looking to delete rows that have same score for 5 days or more. So in below case PeronID AB-123's records from 2/1 to 2/6 needs to be deleted, also for DG-3465 from 2/3 to 2/10. But nothing for TY-9456. I was thinking for using rolling() but that will delete only 2/1-2/5 for AB-123 but not 2/6.

PersonID    Date    Score
AB-123  2/1/2016    0
AB-123  2/2/2016    0
AB-123  2/3/2016    0
AB-123  2/4/2016    0
AB-123  2/5/2016    0
AB-123  2/6/2016    0
AB-123  2/7/2016    67.5
AB-123  2/8/2016    73.4
AB-123  2/9/2016    70.5
AB-123  2/10/2016   68
DG-3465 2/1/2016    22.5
DG-3465 2/2/2016    25.6
DG-3465 2/3/2016    36.4
DG-3465 2/4/2016    36.4
DG-3465 2/5/2016    36.4
DG-3465 2/6/2016    36.4
DG-3465 2/7/2016    36.4
DG-3465 2/8/2016    36.4
DG-3465 2/9/2016    36.4
DG-3465 2/10/2016   36.4
TY-9456 2/1/2016    0
TY-9456 2/2/2016    0
TY-9456 2/3/2016    5.23
TY-9456 2/4/2016    4.12
TY-9456 2/5/2016    5.95
TY-9456 2/6/2016    6.97
TY-9456 2/7/2016    12.45
TY-9456 2/8/2016    15.61
TY-9456 2/9/2016    15.61
TY-9456 2/10/2016   15.61

Tried few different things but I am kinda stuck with nothing poping up in my head. What would you suggest? Using python pandas by the way ;)

PyRaider
  • 607
  • 4
  • 11
  • 21

3 Answers3

4

You can roll on Score column, calculate the running standard deviation, and then drop rows where the standard deviations are zero along with five rows before them (this assumes you want to delete rows with same scores on consecutive days):

df.drop(np.unique(df.Score.rolling(5).std()[lambda x: x == 0].index.values - pd.np.arange(5)[:, None]))

enter image description here

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Both @A-Za-z and Psidom solution worked correctly but Psidom was faster. I tested it with timeit and it was 17.1236965103 vs 49.4611104532 for 10000. Hence I selected this answer. – PyRaider May 16 '17 at 21:25
  • 1
    @Psidom Now this is a really cool solution avoiding expensive groupbys. – Scott Boston May 17 '17 at 13:22
3

You groupby shift and cumsum(). Edited to include @Scott Boston's suggestion

df.groupby(['PersonID',(df.Score != df.Score.shift()).cumsum()]).filter(lambda x: x.Score.size < 5)


    PersonID    Date    Score
6   AB-123  2/7/2016    67.50
7   AB-123  2/8/2016    73.40
8   AB-123  2/9/2016    70.50
9   AB-123  2/10/2016   68.00
10  DG-3465 2/1/2016    22.50
11  DG-3465 2/2/2016    25.60
20  TY-9456 2/1/2016    0.00
21  TY-9456 2/2/2016    0.00
22  TY-9456 2/3/2016    5.23
23  TY-9456 2/4/2016    4.12
24  TY-9456 2/5/2016    5.95
25  TY-9456 2/6/2016    6.97
26  TY-9456 2/7/2016    12.45
27  TY-9456 2/8/2016    15.61
28  TY-9456 2/9/2016    15.61
29  TY-9456 2/10/2016   15.61
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • I think you need to add PersonID in you groupby just incase you have consecutive PersonID with 3 score 0 and 2 score 0. Your current statement would filter those records out as well. df.groupby(['PersonID',(df.Score != df.Score.shift()).cumsum()]).filter(lambda x: x.Score.size < 5) – Scott Boston May 16 '17 at 19:39
  • @Scott Boston, you are right, didn't think of that. Thanks! – Vaishali May 16 '17 at 19:41
0

You can exclude those rows who are different by 0 and offset by 1 day:

In [11]: df[(df.Score.diff() != 0) | (df.Date.diff() != pd.offsets.Day().delta)]
Out[11]:
   PersonID       Date  Score
0    AB-123 2016-02-01   0.00
6    AB-123 2016-02-07  67.50
7    AB-123 2016-02-08  73.40
8    AB-123 2016-02-09  70.50
9    AB-123 2016-02-10  68.00
10  DG-3465 2016-02-01  22.50
11  DG-3465 2016-02-02  25.60
12  DG-3465 2016-02-03  36.40
20  TY-9456 2016-02-01   0.00
22  TY-9456 2016-02-03   5.23
23  TY-9456 2016-02-04   4.12
24  TY-9456 2016-02-05   5.95
25  TY-9456 2016-02-06   6.97
26  TY-9456 2016-02-07  12.45
27  TY-9456 2016-02-08  15.61
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535