1

I would like to use the day of the year in a machine learning model. As the day of the year is not continuous (day 365 of 2019 is followed by day 1 in 2020), I think of performing cyclic (sine or cosine) transformation, following this link.

However, in each year, there are no unique values of the new transformed variable; for example, two values for 0.5 in the same year, see figures below. I need to be able to use the day of the year in model training and also in prediction. For a value of 0.5 in the sine transformation, it can be on either 31.01.2019 or 31.05.2019, then using 0.5 value can be confusing for the model.

Is it possible to make the model to differentiate between the two values of 0.5 within the same year?

I am modelling the distribution of a species using Maxent software. The species data is continuous every day in 20 years. I need the model to capture the signal of the day or the season, without using either of them explicitly as categorical variable.

Thanks

Sine transformation Cosine transformation

EDIT1 Based on furcifer's comment below. However, I find the Incremental modelling approach not useful for my application. It solves the issue of consistent difference between subsequent days; e.g. 30.12.2018, 31.12.2018, and 01.01.2019. But it does not differ than counting the number of days from a certain reference day (weight = 1). Having much higher values on the same date for 2019 than 2014 does not make ecological sense. I hope that interannual changes to be captured from the daily environmental conditions used (explanatory variables). The reason for my need to use day in the model is to capture the seasonal trend of the distribution of a migratory species, without the explicit use of month or season as a categorical variable. To predict suitable habitats for today, I need to make this prediction not only depends on the environmental conditions of today but also on the day of the year.

Ahmed El-Gabbas
  • 398
  • 3
  • 10

2 Answers2

1

This is a common problem, but I'm not sure if there is a perfect solution. One thing I would note is that there are two things that you might want to model with your date variable:

  • Seasonal effects
  • Season-independent trends and autocorrelation

For seasonal effects, the cyclic transformation is sometimes used for linear models, but I don't see the sense for ML models - with enough data, you would expect a nice connection at the edges, so what's the problem? I think the posts you link to are a distraction, or at least they do not properly explain why and when a cyclic transformation is useful. I would just use dYear to model the seasonal effect.

However, the discontinuity might be a problem for modelling trends / autocorrelation / variation in the time series that is not seasonal, or common between years. For that reason, I would add an absolute date to the model, so use

y = dYear + dAbsolute + otherPredictors

A well-tuned ML model should be able to do the rest, with the usual caveats, and if you have enough data.

0

This may not the right choice depending on your needs, there are two choices that comes to my mind.

  1. Incremental modeling

In this case, the dates are modeled in a linear fashion, so say 12 Dec, 2018 < 12, Dec, 2019.

For this you just need some form of transformation function that converts dates to numeric values.

As there are many dates that need to be converted to numeric representation, the first thing to make sure is that the output list also has the same order as Lukas mentioned. The easiest way to do this is by adding weight to each unit (weight_year > weight_month > weight_day).

def date2num(date_time):
  d, m, y = date_time.split('-')
  num = int(d)*10 + int(m)*100 + int(y)*1000 # these weights can be anything as long as 
                                             # they are ordered
  return num

Now, it's important to normalize the numeric values.

import numpy as np
date_features = []
for d in list(df['date_time']):
  date_features.append(date2num(d))
date_features = np.array(date_features)
date_features_normalized = (date_features - np.min(date_features))/(np.max(date_features) - np.min(date_features))
  1. Using the day, month, year as separate features. So, instead of considering the date as whole, we segregate. The motivation is that maybe there will be some relations between the output and a specific date, month, etc. Like, maybe the output suddenly increases in the summer season (specific months) or maybe on weekends (specific days)
Zabir Al Nazi
  • 10,298
  • 4
  • 33
  • 60