2

i am currently writing a "Split - Apply - Combine" pipeline for my data analysis, which also involves dates. Here's some sample data:

In [1]:
import pandas as pd
import numpy as np
import datetime as dt

startdate = np.datetime64("2018-01-01")
randdates = np.random.randint(1, 365, 100) + startdate

df = pd.DataFrame({'Type': np.random.choice(['A', 'B', 'C'], 100), 
                   'Metric': np.random.rand(100), 
                   'Date': randdates})
df.head()

Out[1]: 
  Type    Metric       Date
0    A  0.442970 2018-08-02
1    A  0.611648 2018-02-11
2    B  0.202763 2018-03-16
3    A  0.295577 2018-01-09
4    A  0.895391 2018-11-11

Now I want to aggregate by 'Type' and get summary statistics for the respective variables. This is easy for numerical variables like 'Metric':

df.groupby('Type')['Metric'].agg(('mean', 'std'))

For datetime objects however, calculating a mean, standard deviation, or other statistics doesn't really make sense and throws an error. The context I need this operation for, is that I am modelling a Date based on some distance metric. When I repeat this modelling with random sampling (monte-carlo simulation), I later want to reassign a mean and confidence interval to the modeled dates.

So my Question is: What useful statistics can be built with datetime data? How do you represent the statistical distribution of modelled dates? And how do you implement the aggregation operation?

My Ideal output would be to get a Date_mean and Date_stdev column representing a range for my modeled dates.

flurble
  • 1,086
  • 7
  • 21
  • you could consider using timedeltas. Intuitively "a mean of 5 days +/- a standard deviation of half a day" sounds more reasonable that "a mean of the 5th of may, 2019". Seeing that time is not really a dependent variable in your case, what do you hope your summary statistics will tell you? – warped May 15 '19 at 10:28
  • That’s true. I am reconstructing calendar dates based on a distance measurement. Imagine tree rings, which are wider ore more narrow depending on how fast the tree grows. In the end I wanna day something like „a growth of 8 mm corresponds to 24 +- 2 days“ - which could be possible with timedeltas. But given that growth is faster in summer and slower in winter, I might need to say something like „since 10.07.2018 it has grown 18mm, which corresponds to 176 +-12 days and equals 11.04.2019“. – flurble May 15 '19 at 10:45
  • So, Time *is the dependant* variable here, because I modeled it in a previous step based on growth-distance (`metric`), some chemical analyses, a prediction and dtw... Another way of saying the statistics I want could be "At the 3rd of April, specimen grow 0.4mm +- 0.001mm" or also "the specimen grew 4 mm since the last measurement at 04.04.2018, so at this measurement it must be 08.05.2018 +- 3 days". I reconstructed the growth / date correlation based on So growth/distance is the known and date/time the unknown – flurble May 15 '19 at 11:09
  • Please don't write answers in comments :) – prosti May 15 '19 at 11:39

2 Answers2

1

You can use timestamps (Unix)

Epoch, also known as Unix timestamps, is the number of seconds (not milliseconds!) that have elapsed since January 1, 1970 at 00:00:00 GMT (1970-01-01 00:00:00 GMT).

You can convert all your dates to timestamps liks this:

import time
import datetime
d = "2018-08-02"
time.mktime(datetime.datetime.strptime(d, "%Y-%m-%d").timetuple()) #1533160800

And from there you can calculate what you need.

prosti
  • 42,291
  • 14
  • 186
  • 151
0

You can compute min, max, and mean using the built-in operations of the datetime:

    date = dt.datetime.date
    df.groupby('Type')['Date'].agg(lambda x:(date(x.mean()), date(x.min()), date(x.max())))

    Out[490]:
    Type
    A    (2018-06-10, 2018-01-11, 2018-11-08)
    B    (2018-05-20, 2018-01-20, 2018-12-31)
    C    (2018-06-22, 2018-01-04, 2018-12-05)
Name: Date, dtype: object

I used date(x) to make sure the output fits here, it's not really needed.

David Makovoz
  • 1,766
  • 2
  • 16
  • 27