3

I'm having the particular case in R where two categories are displayed in a yearly panel data but do not correspond to the level I am using, this is, there are two cities that belong to the same state (A) but are on their individual city levels (a and b) and the rest of the data is on State level (Dep).

year Dep Homicides
2000 C 20
2000 a 5
2000 b 3
2000 D 17
2000 E 18
2001 C 21
2001 a 6
2001 b 4
2001 D 18
2001 E 18

So I want to form a new category by sum using Dep == "a" and Dep =="b", that is, to sum my two cities that belongs to Dep A (also this is not exist, so it needs to be created) by each year, where the final output looks like:

year Dep Homicides
2000 C 20
2000 A 8
2000 D 17
2000 E 18
2001 C 21
2001 A 10
2001 D 18
2001 E 18

Thank you in advance!

2 Answers2

1

We could recode the 'Dep' levels to 'A' and use as grouping variable to get the sum

library(forcats)
library(data.table)
dt1[, .(Homicides = sum(Homicides)), 
   .(year, Dep = fct_other(Dep, drop = c("a", "b"), other_level = "A"))]

-output

    year    Dep Homicides
    <int> <fctr>     <int>
1:  2000      C        20
2:  2000      A         8
3:  2000      D        17
4:  2000      E        18
5:  2001      C        21
6:  2001      A        10
7:  2001      D        18
8:  2001      E        18

Or could use fct_collapse as well

dt1[, .(Homicides = sum(Homicides)), .(year, 
       Dep = fct_collapse(Dep, A = c("a", "b")))]

-output

   year    Dep Homicides
   <int> <fctr>     <int>
1:  2000      C        20
2:  2000      A         8
3:  2000      D        17
4:  2000      E        18
5:  2001      C        21
6:  2001      A        10
7:  2001      D        18
8:  2001      E        18

data

dt1 <- structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 
2001L, 2001L, 2001L, 2001L), Dep = c("C", "a", "b", "D", "E", 
"C", "a", "b", "D", "E"), Homicides = c(20L, 5L, 3L, 17L, 18L, 
21L, 6L, 4L, 18L, 18L)), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you very much for this, it is really useful, however, when I try it on the multivariate case It doesn't work any longer, that is, if the panel contains Homicides + other variables. Any sugestions? – John M. Riveros Sep 14 '22 at 19:26
  • @JohnM.Riveros please update your post with a new example/expected output so that it is clear – akrun Sep 14 '22 at 19:48
1

Another data.table option is using replace

> dt[,lapply(.SD, sum), .(year, Dep = replace(Dep, Dep %in% c("a","b"), "A"))]
   year Dep Homicides
1: 2000   C        20
2: 2000   A         8
3: 2000   D        17
4: 2000   E        18
5: 2001   C        21
6: 2001   A        10
7: 2001   D        18
8: 2001   E        18
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81