1

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.

I have the a dataset looking like this:

county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)

For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed

The final dataset should look like this:

county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)

I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.

Julia B
  • 13
  • 2

2 Answers2

1

We can do

library(dplyr)
data$naics <- as.character(data$naics)

data %>%
    filter(naics %in% c(423620, 423720)) %>% group_by(county) %>% 
    summarise(naics = "423620", employment = sum(employment)) %>%
    bind_rows(., filter(data, !naics  %in% c(423620, 423720)))
# A tibble: 5 x 3
#   county  naics employment
#  <fctr>  <chr>      <dbl>
#1   1001 423620         10
#2   1202 423620         10
#3   1303 423620          5
#4   1001 423630          6
#5   1303 423550          6
akrun
  • 874,273
  • 37
  • 540
  • 662
  • May I ask, why do we use "423720" in: **filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%**, when we in the next step only summarize over **"naics = "423620"**? – Helen Jun 05 '18 at 09:30
  • 1
    @Erosennin In the `summarise` step, we are creating that 'naics' as 423620. I think the reason could be that the OP wanted the 'employment' sum where 'naics' is either of the two. Also, if you notice, here we are changing the two levels to a single one. I hope this helps – akrun Jun 05 '18 at 15:06
0

With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:

# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
  if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
  v
}

data %>% 
  dplyr::group_by(county) %>% 
  dplyr::mutate(naics = onlyThoseNAICS(naics)) %>% 
  dplyr::group_by(county, naics) %>% 
  dplyr::summarise(employment = sum(employment)) %>% 
  dplyr::ungroup()
Murray Bozinsky
  • 499
  • 4
  • 10