i am currently writing a "Split - Apply - Combine" pipeline for my data analysis, which also involves dates. Here's some sample data:
In [1]:
import pandas as pd
import numpy as np
import datetime as dt
startdate = np.datetime64("2018-01-01")
randdates = np.random.randint(1, 365, 100) + startdate
df = pd.DataFrame({'Type': np.random.choice(['A', 'B', 'C'], 100),
'Metric': np.random.rand(100),
'Date': randdates})
df.head()
Out[1]:
Type Metric Date
0 A 0.442970 2018-08-02
1 A 0.611648 2018-02-11
2 B 0.202763 2018-03-16
3 A 0.295577 2018-01-09
4 A 0.895391 2018-11-11
Now I want to aggregate by 'Type'
and get summary statistics for the respective variables. This is easy for numerical variables like 'Metric'
:
df.groupby('Type')['Metric'].agg(('mean', 'std'))
For datetime objects however, calculating a mean, standard deviation, or other statistics doesn't really make sense and throws an error. The context I need this operation for, is that I am modelling a Date based on some distance metric. When I repeat this modelling with random sampling (monte-carlo simulation), I later want to reassign a mean and confidence interval to the modeled dates.
So my Question is: What useful statistics can be built with datetime data? How do you represent the statistical distribution of modelled dates? And how do you implement the aggregation operation?
My Ideal output would be to get a Date_mean
and Date_stdev
column representing a range for my modeled dates.