3

I am working with some time series data, where each row is an observation of a person, and I have two time periods, the start date and the end date. I am trying to create dummy variables for each year, such that if the year falls between the start date and the end date, the dummy is coded as 1.

The end result is to use this for data visualization purposes on demographics by year.

I've looked at some packages, but it seems to create dummies from variables already provided. Since some of the years may be missing from one of the columns, I'm trying to find an alternate option.

id <- c(1:3)
start.date <- c(1990, 1850, 1910)
end.date <- c(2014, 1920, 1980)

df <- data.frame(id, start.date, end.date)

df

As you can see from the structure of the data, I would like individual 1, for instance, to have the dummies coded between 1990 and 2014 as 1, and 0 otherwise.

M--
  • 25,431
  • 8
  • 61
  • 93
Ryan
  • 77
  • 7
  • Can you provide a sample from your real data? – maydin Aug 13 '19 at 21:28
  • The dataframe above is extracted from the actual data. – Ryan Aug 13 '19 at 21:32
  • In real data, you need to have some varying date column. I can't see in here. So I don't know the starting point of your time series. – maydin Aug 13 '19 at 21:33
  • The starting point of the time series is the earliest start.date (which in this case is 1850). – Ryan Aug 13 '19 at 21:36
  • I think it is a panel data set rather than a time series data. Because you have individual column as well. So you have observations starting from 1850 to 2014 for each individual? Which makes it 165 * unique(id) long, right? – maydin Aug 13 '19 at 21:43

3 Answers3

2

Another method of doing this using tidyr::expand instead of dplyr::transmute.

df1 <- data.frame(id = c(1:3),
                  start.date = c(1990, 1850, 1910),
                  end.date = c(2014, 1920, 1980))
library(dplyr)
library(tidyr)
df1 %>%
  group_by(id) %>% 
  expand(year = start.date:end.date) %>% 
  mutate(value = 1, year = paste0("Y",year)) %>% 
  ungroup %>% 
  spread(year, value, fill = 0)

#> # A tibble: 3 x 157
#>      id Y1850 Y1851 Y1852 Y1853 Y1854 Y1855 Y1856 Y1857 Y1858 Y1859 Y1860
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     0     0     0     0     0     0     0     0     0     0     0
#> 2     2     1     1     1     1     1     1     1     1     1     1     1
#> 3     3     0     0     0     0     0     0     0     0     0     0     0
#> # ... with 145 more variables: Y1861 <dbl>, Y1862 <dbl>, Y1863 <dbl>,...
M--
  • 25,431
  • 8
  • 61
  • 93
1

If I understand correctly, you want a dataframe with all years for every id -

library(dplyr)
library(tidyr)

df %>% 
  group_by(id) %>% 
  transmute(years = list(paste0("Y", start.date:end.date)), value = 1) %>% 
  unnest() %>% 
  ungroup() %>% 
  spread(years, value, fill = 0)

# showing first 10 of total 157 columns
# A tibble: 3 x 10
     id Y1850 Y1851 Y1852 Y1853 Y1854 Y1855 Y1856 Y1857 Y1858
  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     0     0     0     0     0     0     0     0     0
2     2     1     1     1     1     1     1     1     1     1
3     3     0     0     0     0     0     0     0     0     0
Shree
  • 10,835
  • 1
  • 14
  • 36
  • This is basically correct, except I want it in the wide format, with column names being the year (something like exist_1850, and then a 1 or 0. There are some other variables (that I haven't included for ease of intuition) that I'd rather not reshape if possible. – Ryan Aug 13 '19 at 21:35
  • Here's a follow up question, which is totally left field but I thought I might as well ask. So the whole purpose of this expansion was to start visualizing the data by year. For instance, take a look at the % Female per year using a plot in ggplot2. Since each row is an observation, is this the best way to approach that end goal? I was thinking that if I somehow filtered observations when Y1850==1 (for instance), I could then find the % female and then plot. But obviously I have quite a few years. – Ryan Aug 29 '19 at 21:48
0

Here's a base way to do it:

seqs <- df$end.date - df$start.date+1

table(data.frame(id = rep(df[['id']], seqs)
           , Y = paste0('Y', rep(df[['start.date']], seqs) + sequence(seqs) - 1)))

   Y
id  Y1850 Y1851 Y1852 Y1853 Y1854 Y1855 Y1856 Y1857 Y1858 Y1859 ... lots of columns
  1     0     0     0     0     0     0     0     0     0     0
  2     1     1     1     1     1     1     1     1     1     1
  3     0     0     0     0     0     0     0     0     0     0
Cole
  • 11,130
  • 1
  • 9
  • 24