1

I have a dataset with a reading (Tank Level) every minute from a piece of equipment and want to create a new dataset (dataframe) with a count of the number of samples per day and the number of readings above a set value.

        Noxious Tank Level.MIN  Noxious Tank Level.MAX           Date_Time
0                         9.32                    9.33 2019-12-31 05:01:00
1                         9.32                    9.34 2019-12-31 05:02:00
2                         9.32                    9.35 2019-12-31 05:03:00
3                         9.31                    9.35 2019-12-31 05:04:00
4                         9.31                    9.35 2019-12-31 05:05:00
...                        ...                     ...                 ...
528175                    2.98                    3.01 2020-12-31 23:56:00
528176                    2.98                    3.02 2020-12-31 23:57:00
528177                    2.98                    3.01 2020-12-31 23:58:00
528178                    2.98                    3.02 2020-12-31 23:59:00
528179                    2.98                    2.99 2021-01-01 00:00:00

Using a lamdba function I can see whether each value is an overflow (Tank Level > setpoint) - I have also indexed the dataframe by Date_Time:

df['Overflow'] = df.apply(lambda x: True if x['Noxious Tank Level.MIN'] > 89 else False , axis=1)

                     Noxious Tank Level.MIN  Noxious Tank Level.MAX  Overflow
Date_Time                                                                    
2019-12-31 05:01:00                    9.32                    9.33     False
2019-12-31 05:02:00                    9.32                    9.34     False
2019-12-31 05:03:00                    9.32                    9.35     False
2019-12-31 05:04:00                    9.31                    9.35     False
2019-12-31 05:05:00                    9.31                    9.35     False
...                                     ...                     ...       ...
2020-12-31 23:56:00                    2.98                    3.01     False
2020-12-31 23:57:00                    2.98                    3.02     False
2020-12-31 23:58:00                    2.98                    3.01     False
2020-12-31 23:59:00                    2.98                    3.02     False
2021-01-01 00:00:00                    2.98                    2.99     False

Now I want to count the number of samples per day and the number of 'True' values in the Overflow column to work out what fraction per day is in Overflow

I get the feeling that resample or groupby will be the way to go but I can't figure out how to create a new dataset with just these counts and include the conditional count from the Overflow column

ChemEnger
  • 131
  • 12

1 Answers1

2

First use:

df['Overflow'] = df['Noxious Tank Level.MIN'] > 89

And then for count Trues use sum nad for count values use size per days/ dates:

df1 = df.resample('d')['Overflow'].agg(['sum','size'])

Or:

df1 = df.groupby(pd.Grouper(freq='D'))['Overflow'].agg(['sum','size'])

Or:

df2 = df.groupby(df.index.date)['Overflow'].agg(['sum','size'])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks @jezrael. As always with python / pandas it's think of the simplest way to do it and that will be the answer. Does exactly what I needed - it was the `.agg` part that I was missing. – ChemEnger Jun 23 '21 at 13:01