I have a dataframe containing logs just like this example :
+------------+--------------------------+--------------------+-------------------+
|Source |Error | @timestamp| timestamp_rounded |
+------------+--------------------------+--------------------+-------------------+
| A | No |2021-09-12T14:07:...|2021-09-12 16:10:00|
| B | No |2021-09-12T12:49:...|2021-09-12 14:50:00|
| C | No |2021-09-12T12:59:...|2021-09-12 15:00:00|
| C | No |2021-09-12T12:58:...|2021-09-12 15:00:00|
| B | No |2021-09-12T14:22:...|2021-09-12 16:20:00|
| A | Yes |2021-09-12T14:22:...|2021-09-12 16:25:00|
| B | No |2021-09-12T13:00:...|2021-09-12 15:00:00|
| B | No |2021-09-12T12:57:...|2021-09-12 14:55:00|
| B | No |2021-09-12T12:57:...|2021-09-12 15:00:00|
| B | No |2021-09-12T12:58:...|2021-09-12 15:00:00|
| C | No |2021-09-12T12:54:...|2021-09-12 14:55:00|
| A | Yes |2021-09-12T14:17:...|2021-09-12 16:15:00|
| B | No |2021-09-12T12:43:...|2021-09-12 14:45:00|
| A | No |2021-09-12T12:45:...|2021-09-12 14:45:00|
| D | No |2021-09-12T12:57:...|2021-09-12 14:55:00|
| A | No |2021-09-12T13:00:...|2021-09-12 15:00:00|
| C | No |2021-09-12T12:47:...|2021-09-12 14:45:00|
| A | No |2021-09-12T12:57:...|2021-09-12 15:00:00|
| A | No |2021-09-12T13:00:...|2021-09-12 15:00:00|
| A | No |2021-09-12T14:23:...|2021-09-12 16:25:00|
+------------+--------------------------+--------------------+-------------------+
only showing top 20 rows
My dataframe has million of logs, not that it matters.
I would like to calculate the error rate of every source, for every 5 minutes. I have searched for documentation on transformations like this one (groupby with partition ? double groupby ?...) but I haven't found a lot of information.
I can get a new column with Yes ==> 1 and No ==> 0 and then get the mean for every source with gorupby
and {avg: foo}
to get the error rate for every source, but I want it to be for every 5 min (see col 'timestamp_rounded')
The result would be like :
+-------------------+------------+--------------+-------------+------------+
|timestamp_rounded |Error_rate_A| Error_rate_B | Error_rate_C|Error_rate_D|
+-------------------+------------+--------------+-------------+------------+
|2021-09-12 16:10:00| 0 | 0.2 | 0 | 0.2 |
|2021-09-12 16:15:00| 0.1 | 0.3 | 0 | 0 |
|2021-09-12 16:20:00| 0 | 0.2 | 0 | 0 |
|2021-09-12 16:25:00| 0 | 0.2 | 0 | 0 |
|2021-09-12 16:30:00| 0 | 0.2 | 0 | 0 |
|2021-09-12 16:35:00| 0.2 | 0.2 | 0 | 0 |
|2021-09-12 16:40:00| 0.3 | 0.2 | 0 | 0.2 |
|2021-09-12 16:45:00| 0.4 | 0.3 | 0 | 0 |
etc...
Sources can be very numerous (my example has 4 but there can be thousands of sources)
Please tell me if you need more information. Thanks a lot !