4

I have a dataframe with two columns (age, date) indicating the age of a person and the current date. I want to approximate the date of birth from that data. I thought to fit a linear model and find the interception with the, but it does not work out of the box. Pandas does not support the ols() function anymore.

import pandas as pd
import seaborn as sns
from pandas import Timestamp

age = [30, 31, 31, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 34, 34]
date = [Timestamp('2001-02-10 00:01:00'),
 Timestamp('2001-11-12 00:01:00'),
 Timestamp('2002-02-27 00:01:00'),
 Timestamp('2002-07-05 00:01:00'),
 Timestamp('2002-07-20 00:01:00'),
 Timestamp('2002-08-15 00:01:00'),
 Timestamp('2002-09-08 00:01:00'),
 Timestamp('2002-10-15 00:01:00'),
 Timestamp('2002-12-21 00:01:00'),
 Timestamp('2003-04-04 00:01:00'),
 Timestamp('2003-07-29 00:01:00'),
 Timestamp('2003-08-11 00:01:00'),
 Timestamp('2004-02-28 00:01:00'),
 Timestamp('2005-01-11 00:01:00'),
 Timestamp('2005-01-12 00:01:00')]

df = pd.DataFrame({'age': age, 'date': date})

sns.regplot(df.age, df.date)

Throws an error:

TypeError: reduction operation 'mean' not allowed for this dtype

What is the best way to transform the data to something that can be fitted and transform it back to dates and estimate confidence intervals? Is there any package that can handle pandas.Timestamps out of the box? E.g. scikit-learn?

ALollz
  • 57,915
  • 7
  • 66
  • 89
Soerendip
  • 7,684
  • 15
  • 61
  • 128

1 Answers1

5

Use pd.to_numeric to convert to unix time, in this case the number of nanoseconds since 1970-01-01.

import pandas as pd

df['date'] = pd.to_numeric(df.date)
sns.regplot(df.age, df.date)

enter image description here

You can then just easily convert this back to a date with pd.to_datetime().


Example: Here's a simple linear fit

import numpy as np
df['date'] = pd.to_numeric(df.date)
fit = np.polyfit(df.age, df.date, 1)

# Here's the predicted Birthday in unix time
np.polyval(fit, 0)
#4.966460634146548e+16

# Here's the same result transformed to a date.
pd.to_datetime(np.polyval(fit,0))
#Timestamp('1971-07-29 19:43:26.341465480')
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Though, people generally age 1 year per year. So, the slope of the linear fit should be fixed to 1 year (converted to unix time?) . :-) – Soerendip Jun 22 '18 at 18:56
  • @Sören Good point, you'd just want the intercept to be your 1 free-parameter, and set the slope to the number of nanoseconds in a year, so use `scipy.optimize` to fit your actual function. That was just a quick illustration. – ALollz Jun 22 '18 at 19:43
  • Totally. That's very easy to do. – Soerendip Jun 23 '18 at 01:03