0

I'm trying to do predictive models for time series but I'm facing issues when we have the same year appearing twice or multiple times in a dataset.

To give some context, I'm using this kaggle database that shows the life expectancy for people in certain countries in each year from 2000 to 2015, the dataset is pretty clean, we have no NAs and no issues with the data that we need to deal with.

What I've done so far is:

library(tidyverse)
library(tidymodels)
library(readxl)
library(janitor)
library(modeltime)
library(lubridate)

dados <- read_csv("Life Expectancy Data.csv")

dados <- clean_names(dados)

dados <- subset(dados, select = -c(economy_status_developing))
dados$economy_status_developed <- as.factor(dados$economy_status_developed)
dados$year <- ymd(sprintf("%d-01-01",dados$year))

splits <- time_series_split(date_var = year, dados)

recipe_spec <- recipe(life_expectancy ~ year + infant_deaths + under_five_deaths + adult_mortality + alcohol_consumption + hepatitis_b + measles + bmi + polio + diphtheria + incidents_hiv + gdp_per_capita + population_mln + thinness_ten_nineteen_years + thinness_five_nine_years + schooling + economy_status_developed, dados) %>%
  step_timeseries_signature(year) %>%
  step_dummy(all_nominal())  

recipe_spec %>% prep() %>% juice()

But here the warning message that I get when I run the splits line: "Data is not ordered by the 'date_var'. Resamples will be arranged by year. Overlapping Timestamps Detected. Processing overlapping time series together using sliding windows.".

I don't know what to do in order for the model to understand that I have multiple countries in the database, which can cause the same year to appear multiple times, in fact, each year will apear 179 times as we have 179 different countries. How can I fix it?

  • It depends what you're trying to do but you may need to deal with the countries separately, group them into regions that make sense for your subject topic, or aggregate them all together. The latter option is probably not ideal if you have many countries. – NicChr Jun 17 '23 at 09:43
  • another option is taking the average of the years, (if that is appropriate for what you are trying to do) – Mark Jun 17 '23 at 11:06
  • so there's nothing like nested time series or something like that? – Paulo Luis Jun 17 '23 at 14:02

0 Answers0