The title might be a bit confusing here, but in essence: We have a dataframe, let’s say like this:
SKUCODE | PROCCODE | RUNCODE | RSNCODE | BEGTIME | ENDTIME | RSNNAME | |
---|---|---|---|---|---|---|---|
0 | 218032 | A | 21183528 | 1010 | 2020-11-19 04:00 | 2020-11-19 04:15 | Bank holidays |
1 | 218032 | A | 21183959 | 300.318 | 2021-01-27 10:50 | 2021-01-27 10:53 | Base paper roll change |
2 | 218032 | A | 21183959 | 300.302 | 2021-01-27 11:50 | 2021-01-27 11:53 | Product/grade change |
3 | 218032 | A | 21183959 | 300.302 | 2021-01-27 12:02 | 2021-01-27 13:12 | Product/grade change |
4 | 200402 | A | 21184021 | 300.302 | 2021-01-27 13:16 | 2021-01-27 13:19 | Product/grade change |
Where each row is a break event happening on a production line. As can be seen, some singular break events (common RSNNAME) are spread out on multiple consecutive rows (for one data gathering reason or another), and we would like to compress all of these into just one row, for example compressing the rows 2 through 4 to a single row in our example dataframe, resulting in something like this:
SKUCODE | PROCCODE | RUNCODE | RSNCODE | BEGTIME | ENDTIME | RSNNAME | |
---|---|---|---|---|---|---|---|
0 | 218032 | A | 21183528 | 1010 | 2020-11-19 04:00 | 2020-11-19 04:15 | Bank holidays |
1 | 218032 | A | 21183959 | 300.318 | 2021-01-27 10:50 | 2021-01-27 10:53 | Base paper roll change |
2 | 218032 | A | 21183959 | 300.302 | 2021-01-27 11:50 | 2021-01-27 13:19 | Product/grade change |
The resulting one row would have the BEGTIME (signifying the start of the break) of the first row that was combined, and the ENDTIME (signifying the end of the break) of the last row that was combined, this way making sure we capture the correct timestamps from the entire break event.
If we want to make the problem harder still, we might want to add a time threshold for row combining. Say, if there is a period of more than 15 minutes between (the ENDTIME of the former and the BEGTIME of the latter) two different rows seemingly of the same row event, we would treat them as separate ones instead.
This is accomplished quite easily through iterrows by comparing one row to the next, checking if they contain a duplicate value in the RSNNAME column, and grabbing the ENDTIME of the latter one onto the former one if that is the case. The latter row can then be dropped as useless. Here we might also introduce logic to see if the seemingly singular break events might actually be telling of two different ones of the same nature merely happening some time apart of each other.
However, using iterrows for this purpose gets quite slow. Is there a way to figure out this problem through vectorized functions or other more efficient means? I've played around with shifting the rows around and comparing to each other - shifting and comparing two adjacent rows is quite simple and allows us to easily grab the ENDTIME of the latter row if a duplicate is detected, but we run into issues in the case of n consecutive duplicate causes.
Another idea would be to create a boolean mask to check if the row below the current one is a duplicate, resulting in a scenario where, in the case of multiple consecutive duplicate rows, we have multiple corresponding consecutive "True" labels, the last of which before a "False" label signifies the last row where we would want to grab the ENDTIME from for the first consecutive "True" label of that particular series of consecutive "Trues". I'm yet to find a way to implement this in practice, however, using vectorization.