0

I have the following Pandas datetime-indexed dataframe:

date_time category num_files num_lines worst_index
2022-07-15 23:50:00 black 2 868 0.01
2022-07-15 23:50:00 red 5 5631 0.01
2022-07-15 23:50:00 green 1 1891 0.00
2022-07-15 23:50:00 all 8 8390 0.01
2022-07-16 00:00:00 all 0 0 0.00
2022-07-16 00:10:00 all 0 0 0.00
2022-07-16 00:20:00 black 1 656 0.00
2022-07-16 00:20:00 red 2 4922 0.00
2022-07-16 00:20:00 green 1 1847 0.00
2022-07-16 00:20:00 all 4 7425 0.00
2022-07-16 00:30:00 all 0 0 0.00

The data is collected every 10 minutes for the categories "black", "red" and "green" + there is a summary category "all" with respectively cumulated values for "num_files", "num_lines" and "worst_index".

In case, that num_files, num_lines or worst_index for the "all" category of a measurement point is 0 (zero), I would like to set those values for the three categories "black", "red" and "green" also to 0 (zero) in the dataframe. So, either insert a corresponding row if there is none for that timestamp so far.

Background is that I found the subsequently generated matplotlib graphs indicating wrongly for the three categories: e.g. for category "black" there should not be a direct line between timestamp "2022-07-15 23:50:00" "num_files"-value 2 and "num_files"-value 1 at timestamp "2022-07-16 00:20:00" as actually "num_files" for category black was 0 (zero) for the timestamps "2022-07-16 00:00:00" and "2022-07-16 00:10:00" in between but unfortunately the data is collected like this which I cannot change.

I tried to iterate through the datetime indexed dataframe using iterrows and to select / filter with loc but did not manage it with my too junior Python and Pandas knowledge and experience.

Mr.Rulez
  • 25
  • 1
  • 5

1 Answers1

1

You can do this with a reindexing operation, treating date_time and category as a multi-index. First, construct the final desired index (i.e., 10 minute separated dates with an entry for every category). The MultiIndex.from_product method does this neatly:

drange = pd.date_range(df['date_time'].min(), df['date_time'].max(), freq='10T')
cats = ['black', 'green', 'red', 'all']
new_idx = pd.MultiIndex.from_product([drange, cats], names=['date_time', 'category'])

Then, reindex your data with the new_idx (after temporarily turning the date/category columns to the index). Fill any NAs created with 0:

df = df.set_index(['date_time', 'category']).reindex(new_idx).reset_index().fillna(0)

Result:

             date_time category  num_files  num_lines  worst_index
0  2022-07-15 23:50:00    black        2.0      868.0         0.01
1  2022-07-15 23:50:00    green        1.0     1891.0         0.00
2  2022-07-15 23:50:00      red        5.0     5631.0         0.01
3  2022-07-15 23:50:00      all        8.0     8390.0         0.01
4  2022-07-16 00:00:00    black        0.0        0.0         0.00
5  2022-07-16 00:00:00    green        0.0        0.0         0.00
6  2022-07-16 00:00:00      red        0.0        0.0         0.00
7  2022-07-16 00:00:00      all        0.0        0.0         0.00
8  2022-07-16 00:10:00    black        0.0        0.0         0.00
9  2022-07-16 00:10:00    green        0.0        0.0         0.00
10 2022-07-16 00:10:00      red        0.0        0.0         0.00
11 2022-07-16 00:10:00      all        0.0        0.0         0.00
12 2022-07-16 00:20:00    black        1.0      656.0         0.00
13 2022-07-16 00:20:00    green        1.0     1847.0         0.00
14 2022-07-16 00:20:00      red        2.0     4922.0         0.00
15 2022-07-16 00:20:00      all        4.0     7425.0         0.00
16 2022-07-16 00:30:00    black        0.0        0.0         0.00
17 2022-07-16 00:30:00    green        0.0        0.0         0.00
18 2022-07-16 00:30:00      red        0.0        0.0         0.00
19 2022-07-16 00:30:00      all        0.0        0.0         0.00
Tom
  • 8,310
  • 2
  • 16
  • 36
  • Many thanks for your great solution! A small additional question :) So far I graphed based on a datetimeindex per category: `ax.plot(df.index.values[df["category"]=="all"], df.num_files[df["category"]=="all"], linewidth=2.5, color='blue', label = "all") ax.plot(df.index.values[df["category"]=="green"], df.num_files[df["category"]=="84"], linewidth=1.5, color='green', label = "green")` in order to create a plot with timestamps on x-axis and num_values on y-axis with 4 line-graphs for "black", "green", "red" and "all". How could I do it using a MultiIndex for date_time and category? – Mr.Rulez Jul 17 '22 at 09:58
  • @Mr.Rulez I'm not totally sure I follow your question, but I would say putting things back into a MultiIndex will only make the plotting more convoluted, I reckon. The final data I show above doesn't have a MultiIndex (just an integer index). You could make `date_time` be the index again, which may be what you want based on the plotting code you shared. The code you have for plotting looks reasonable - if you are having a particular difficulty it might be worth creating a new post? – Tom Jul 18 '22 at 03:24