0

I have a more or less specific question that probably pertains to loops in R. I have a dataframe:

X       location                                     year         
1      North Dakota, Minnesota, Michigan             2011
2      California, Tennessee                         2012
3      Bastrop County (Texas)                        2013
4      Dallas (Texas)                                2014
5      Shasta (California)                           2015
6      California, Oregon, Washington                2011

I have two problems with this data: 1) I need a column that consists of just the state names of each row. I guess this should be generally easy with gsub and using a list of all US state names.

list <- c("Alabama", "Alaska",  "Arizona", "Arkansas", "California", "etc")


pat <- paste0("\\b(", paste0(list, collapse="|"), ")\\b")    
pat

data$state <- gsub(data$location, "", paragraph)

The bigger issue for me is 2) I need an individual (duplicate) row for each state that is in the dataset. So if row 6 has California, Oregon and Washington in 2011, I need to have a separate row of each one separately like this:

X       location                                     year         
1      California                                    2011
2      Oregon                                        2011
3      Washington                                    2011

Thank you for your help!

1 Answers1

2

You can use str_extract_all to extract all the states and unnest to duplicate rows such that each state is in a separate row. There is an inbuilt constant state.name which have the state names of US which can be used here to create pattern.

library(dplyr)
pat <- paste0("\\b", state.name, "\\b", collapse = "|")    

df %>%
  mutate(states = stringr::str_extract_all(location, pat)) %>% 
  tidyr::unnest(states)


# A tibble: 11 x 3
#   location                           year states      
#   <chr>                             <int> <chr>       
# 1 North Dakota, Minnesota, Michigan  2011 North Dakota
# 2 North Dakota, Minnesota, Michigan  2011 Minnesota   
# 3 North Dakota, Minnesota, Michigan  2011 Michigan    
# 4 California, Tennessee              2012 California  
# 5 California, Tennessee              2012 Tennessee   
# 6 Bastrop County (Texas)             2013 Texas       
# 7 Dallas (Texas)                     2014 Texas       
# 8 Shasta (California)                2015 California  
# 9 California, Oregon, Washington     2011 California  
#10 California, Oregon, Washington     2011 Oregon      
#11 California, Oregon, Washington     2011 Washington  

data

df <- structure(list(location = c("North Dakota, Minnesota, Michigan", 
"California, Tennessee", "Bastrop County (Texas)", "Dallas (Texas)", 
"Shasta (California)", "California, Oregon, Washington"), year = c(2011L, 
2012L, 2013L, 2014L, 2015L, 2011L)), class = "data.frame", row.names = c(NA, -6L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213