-2

I have this df:

df = pd.DataFrame({"on": [1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]}, 
              index=pd.date_range(start = "2020-04-09 6:45", periods = 30, freq = '8H'))

and want to create a weekly profile for the column df['on'].

I can insert the weekdays and times as follows:

df['day_name'] = df.index.day_name() 
df['time'] = df.index.time

and geht this df:

                    on  day_name    time
2020-04-09 06:45:00 1   Thursday    06:45:00
2020-04-09 14:45:00 1   Thursday    14:45:00
2020-04-09 22:45:00 1   Thursday    22:45:00
2020-04-10 06:45:00 1   Friday      06:45:00
2020-04-10 14:45:00 1   Friday      14:45:00
2020-04-10 22:45:00 0   Friday      22:45:00
2020-04-11 06:45:00 0   Saturday    06:45:00
2020-04-11 14:45:00 0   Saturday    14:45:00
2020-04-11 22:45:00 1   Saturday    22:45:00
2020-04-12 06:45:00 0   Sunday      06:45:00
2020-04-12 14:45:00 0   Sunday      14:45:00
2020-04-12 22:45:00 1   Sunday      22:45:00
2020-04-13 06:45:00 1   Monday      06:45:00
2020-04-13 14:45:00 0   Monday      14:45:00
2020-04-13 22:45:00 0   Monday      22:45:00
2020-04-14 06:45:00 0   Tuesday     06:45:00
2020-04-14 14:45:00 0   Tuesday     14:45:00
2020-04-14 22:45:00 1   Tuesday     22:45:00
2020-04-15 06:45:00 0   Wednesday   06:45:00
2020-04-15 14:45:00 1   Wednesday   14:45:00
2020-04-15 22:45:00 1   Wednesday   22:45:00
2020-04-16 06:45:00 0   Thursday    06:45:00
2020-04-16 14:45:00 0   Thursday    14:45:00
2020-04-16 22:45:00 0   Thursday    22:45:00
2020-04-17 06:45:00 1   Friday      06:45:00
2020-04-17 14:45:00 1   Friday      14:45:00
2020-04-17 22:45:00 1   Friday      22:45:00
2020-04-18 06:45:00 0   Saturday    06:45:00
2020-04-18 14:45:00 0   Saturday    14:45:00
2020-04-18 22:45:00 0   Saturday    22:45:00

can someone help me how to get the probability that for a certain time (e.g. tuesday, 22:45) the column df['on'] == 1? And this is best done as a course for the whole week... (In this example the probability for thursday, 22:45 is: 1/2)

Thanks alot :)

Vini
  • 125
  • 1
  • 8

2 Answers2

2

I considered two options in your question:

1. Ratio per week day: I calculated the ratio (probability) that the column 'on'==1 for a given day:

Ratio per weekday:

df_2=pd.DataFrame()
df_2['Ones']=df[df['on']==1]['day_name'].value_counts()
df_2['All']=df['day_name'].value_counts()
df_2['Ratio']=df_2['Ones']/df_2['All']
df_2

Here's the output:

        Ones    All Ratio
Friday     5    6   0.833333
Thursday   3    6   0.500000
Wednesday  2    3   0.666667
Monday     1    3   0.333333
Saturday   1    6   0.166667
Sunday     1    3   0.333333
Tuesday    1    3   0.333333

2. Ratio per day per time: Here i calculated the probability that on day "x" at time "y", the column "on" would be 1:

Ratio per week day and time:

df_3 = df.groupby(['day_name', 'time']).agg({'on': 'count'})
df_3['ones'] = df.groupby(['day_name', 'time']).agg({'on': 'sum'})
df_3['Ratio'] = df_3['ones']/df_3['on']
df_3

Here's the output:

                  on    ones    Ratio
day_name    time            
Friday  06:45:00    2   2   1.0
        14:45:00    2   2   1.0
        22:45:00    2   1   0.5
Monday  06:45:00    1   1   1.0
        14:45:00    1   0   0.0
        22:45:00    1   0   0.0
Saturday06:45:00    2   0   0.0
        14:45:00    2   0   0.0
        22:45:00    2   1   0.5
Sunday  06:45:00    1   0   0.0
        14:45:00    1   0   0.0
        22:45:00    1   1   1.0
Thursday06:45:00    2   1   0.5
        14:45:00    2   1   0.5
        22:45:00    2   1   0.5
Tuesday 06:45:00    1   0   0.0
        14:45:00    1   0   0.0
        22:45:00    1   1   1.0
Wednesday06:45:00   1   0   0.0
        14:45:00    1   1   1.0
        22:45:00    1   1   1.0

Probabily Plotting

To answer your request, I had to do some modifications of the code above: I needed to sort the indexes as requested, merged them into one index then converted the index into strings to avoid some problems in the plotting. Here's the new code:

#Ratio per week day and time

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_3 = df.groupby(['day_name', 'time']).agg({'on': 'count'})
df_3['ones'] = df.groupby(['day_name', 'time']).agg({'on': 'sum'})
df_3['Ratio'] = df_3['ones']/df_3['on']
df_3 = df_3.reindex(days, level=0)
df_3.index = [str(i) for i in (df_3.index.map('{0[0]} : {0[1]}'.format))]
df_3

Now that we did the previous modefications, we can easily plot the Ratio:

#Graph for probabilities

import matplotlib.pyplot as plt
plt.figure()

plt.plot(df_3.index, df_3['Ratio'])
plt.xlabel('Date')
plt.xticks(rotation=90)
plt.title('Probability of "on"=1')

Here's the graph:

enter image description here

  • 1
    thanks for your help. Your 2. option is perfect. can you also help me to create a diagram of the probabilities ( `df2['ratio']`) over a week (starting on Monday) from the 2nd option? Thanks a lot :) – Vini Sep 16 '20 at 13:39
  • You're welcome. I updated my answer to provide the graph you asked for. You can change the type of the graph as you want. Good Luck! – Yassine Majdoub Sep 16 '20 at 15:50
  • thank you very much - your answers are perfekt. Do you know, if its possible, that only the days @ 14:45:00 are shown in the plot x-axis? – Vini Sep 17 '20 at 11:02
  • You're welcome. I'm sorry, I'm not sure I know a simple way to do that. Except if you want to manualy manipulate the list of x-values. – Yassine Majdoub Sep 17 '20 at 11:54
  • No problem. I'm using now this `plt.xticks(np.arange(0, len(df_3.index), step=3), rotation = 90)` and thats enough for me! sorry for the next question, but do you also know, how to create a trendline for this plot? Thanks a lot!!! – Vini Sep 17 '20 at 12:19
  • I used this [answer](https://stackoverflow.com/questions/50474104/python-matplotlib-trend-line-with-string-x-axis-labels) to answer your question. This code works for me: `dates = [dateutil.parser.parse(i) for i in df_3.index] x = mdates.date2num(dates) z = np.polyfit(x, y, 1) p = np.poly1d(z) plt.plot(df_3.index, p(x),"r--")` – Yassine Majdoub Sep 17 '20 at 13:33
  • Thanks for the answer, i tried your code, but getting at the line `dates = ...` following error: `Unknown string format: Monday : 06:45:00` and don't know how to fix it. Can you help me? – Vini Sep 17 '20 at 15:57
  • Yes change this code line `df_3.index = [str(i) for i in (df_3.index.map('{0[0]} : {0[1]}'.format))]` into this `df_3.index = [str(i) for i in (df_3.index.map('{0[0]}, {0[1]}'.format))]` and it should work. – Yassine Majdoub Sep 17 '20 at 16:40
1

I believe you need mean only:

df1 = df.groupby('day_name', sort=False, as_index=False)['on'].mean()
print (df1)
    day_name        on
0   Thursday  0.500000
1     Friday  0.833333
2   Saturday  0.166667
3     Sunday  0.333333
4     Monday  0.333333
5    Tuesday  0.333333
6  Wednesday  0.666667
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252