0

I have a big data.frame where I can find a municipality id for each person(id) over time. Sometimes they move and there will be a new municipality_id from the next day on. However, sometimes the municipality_id stays the same. I would like to collapse those date intervals by each id if there is no real change in municipality_id

This data


 id municipality_id       from         to
1  A             820 2007-01-01 2007-02-28
2  A             200 2007-03-01 2100-01-01
3  B             820 2007-01-01 2007-03-31
4  B             820 2007-04-01 2007-05-31
5  B             830 2007-06-01 2008-01-31
6  B             830 2008-02-01 2100-01-01
7  C             700 2007-01-01 2007-05-31
8  C             500 2007-06-01 2008-12-31
9  C             700 2009-01-01 2100-01-01

should turn into this (two observations for B were extended)


 id municipality_id       from         to
1  A             820 2007-01-01 2007-02-28
2  A             200 2007-03-01 2100-01-01
3  B             820 2007-01-01 2007-05-31
4  B             830 2007-06-01 2100-01-01
5  C             700 2007-01-01 2007-05-31
6  C             500 2007-06-01 2008-12-31
7  C             700 2009-01-01 2100-01-01

Here the code to create my two tables:

data <- data.frame(id = c('A', 'A', 'B', 'B','B', 'B', 'C', 'C', 'C'),
                   municipality_id = c(820, 200, 820, 820, 830, 830, 700, 500, 700),
                   from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-04-01", "2007-06-01", "2008-02-01", "2007-01-01", "2007-06-01", "2009-01-01")),
                   to = as.Date(c("2007-02-28", "2100-01-01", "2007-03-31", "2007-05-31", "2008-01-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01"))) 

Should turn into:

data_edit <- data.frame(id = c('A', 'A', 'B', 'B', 'C', 'C', 'C'),
                   municipality_id = c(820, 200, 820, 830, 700, 500, 700),
                   from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-06-01", "2007-01-01", "2007-06-01", "2009-01-01")),
                   to = as.Date(c("2007-02-28", "2100-01-01", "2007-05-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01"))) 

Is there an easy solution with R? Thank you for helping me :)

econ9595
  • 5
  • 2

1 Answers1

0

How about this:

data <- data.frame(id = c('A', 'A', 'B', 'B','B', 'B', 'C', 'C', 'C'),
                   municipality_id = c(820, 200, 820, 820, 830, 830, 700, 500, 700),
                   from = as.Date(c("2007-01-01", "2007-03-01", "2007-01-01", "2007-04-01", "2007-06-01", "2008-02-01", "2007-01-01", "2007-06-01", "2009-01-01")),
                   to = as.Date(c("2007-02-28", "2100-01-01", "2007-03-31", "2007-05-31", "2008-01-31", "2100-01-01", "2007-05-31", "2008-12-31", "2100-01-01"))) 

library(tidyr)
library(dplyr)
data %>% 
  arrange(id, from) %>% 
  group_by(id) %>% 
  mutate(diff_mid = ifelse(municipality_id == lag(municipality_id), 0, 1), 
         diff_mid = ifelse(is.na(diff_mid), 0, diff_mid), 
         gp = cumsum(diff_mid)) %>%
  pivot_longer(from:to, names_to="frto", values_to="vals") %>% 
  group_by(id, municipality_id, gp) %>% 
  summarise(from = min(vals), to=max(vals)) %>% 
  ungroup %>% 
  arrange(id, from) %>% 
  ungroup %>% 
  select(-gp)
#> `summarise()` has grouped output by 'id', 'municipality_id'. You can override
#> using the `.groups` argument.
#> # A tibble: 7 × 4
#>   id    municipality_id from       to        
#>   <chr>           <dbl> <date>     <date>    
#> 1 A                 820 2007-01-01 2007-02-28
#> 2 A                 200 2007-03-01 2100-01-01
#> 3 B                 820 2007-01-01 2007-05-31
#> 4 B                 830 2007-06-01 2100-01-01
#> 5 C                 700 2007-01-01 2007-05-31
#> 6 C                 500 2007-06-01 2008-12-31
#> 7 C                 700 2009-01-01 2100-01-01

Created on 2022-12-06 by the reprex package (v2.0.1)

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
  • Hi Dave, thanks a lot for your comment. Unfortunately, this doesn't solve my problem since this doesn't consider that a person might live in another municipality for some time and then come back to the original municipality. See for this reason id "C". With your code it just collapses all observations for municipality_id but it should consider those as two different spells, still showing three spells for C. – econ9595 Dec 06 '22 at 15:41
  • @econ9595 Apologies, I updated the answer to solve the problem. – DaveArmstrong Dec 06 '22 at 17:04