0

This is my current coding for data wrangling.

crime_89 <- crime %>%
  filter(state_name %in% c("Georgia", "Pennsylvania", "West Virginia"), !is.na(rape_legacy)) %>%
  mutate(treat = case_when(year < 1989 ~ 0, 
                           TRUE ~ 1),
         id = case_when(state_name == "Georgia" ~ 1,
                        state_name == "Pennsylvania" ~ 2,
                        state_name == "West Virginia" ~ 3),
         crime_rate = 100 * (violent_crime + homicide + rape_legacy + robbery + aggravated_assault + property_crime + burglary + larceny + motor_vehicle_theft) / population )

I know that mutate from dplyr is the tool to create new columns.

The problem is that this crime89 is the narrow scope from the original data limited to specific states. I would like to use the original data crime that has all lists of the states. Then, id columns should be different; counting from Alaska as 1 ...... to Wyoming as 51. I don't know how to create id columns with this condition without using case_when typing all states individually.

Plus, I also need to create treat columns depending on the states. The yardstick is the year the law was launched and '1' means treated status. For example, "Georgia", "Pennsylvania", "West Virginia" are in 1989 so this is why I typed like this::

 mutate(treat = case_when(year < 1989 ~ 0, 
                           TRUE ~ 1),

However, not all of the states have the same year of the law. For example, "Idaho", "Mississippi", "Oregon" is 1990 so the code should be

  mutate(treat = case_when(year < 1990 ~ 0, 
                            TRUE ~ 1))

In the case covering all states, I don't know how to create new columns treat in mutate for different states at once.

This is the original data head(crime) and the year is from 1979 to 2016.

 year state_abbr state_name population violent_crime homicide rape_legacy rape_revised robbery aggravated_assault
1 1979                        220099000       1208030    21460       76390           NA  480700             629480
2 1979         AK     Alaska     406000          1994       54         292           NA     445               1203
3 1979         AL    Alabama    3769000         15578      496        1037           NA    4127               9918
4 1979         AR   Arkansas    2180000          7984      198         595           NA    1626               5565
5 1979         AZ    Arizona    2450000         14528      219        1120           NA    4305               8884
6 1979         CA California   22696000        184087     2952       12239           NA   75767              93129
  property_crime burglary larceny motor_vehicle_theft caveats
1       11041500  3327700 6601000             1112800        
2          23193     5616   15076                2501        
3         144372    48517   83791               12064        
4          70949    21457   45267                4225        
5         177977    48916  116976               12085        
6        1511021   496310  847148              167563        
SSW
  • 31
  • 3
  • 1
    It is advisable to ask one question at a time. Which question do you want to tackle first? – Onyambu May 25 '22 at 21:57
  • `id = as.integer(factor(state_name))` – Onyambu May 25 '22 at 21:57
  • I am sorry for asking two questions at the same time. I bolded two questions as you can see and I need both. – SSW May 25 '22 at 22:00
  • But I cannot tell what exactly the second question is asking probably you mean `case_when(state_name %in% c("Georgia", "Pennsylvania", "West Virginia") & year < 1989 ~0, state_name %in% c("Idaho", "Mississippi", "Oregon") & year < 1990~ 0, true~1)` THIS OF COURSE CAN BE REDUCED FURTHER – Onyambu May 25 '22 at 22:03
  • Originally, I separate data depending on the same year of the state. So I can apply ```case_when``` immediately because ```crime89``` indicates the states with the law in 1989 only. However, in the case of all states, not every state has the same year of 1989. So i cannot apply ```case_when``` with year 1989 only. I need to set all of the states with its year not 1989 only. I need to set the variable ```treat``` for each state depending on its year – SSW May 25 '22 at 22:08
  • Sorry I do not understand your statement. Consider the `id` part only – Onyambu May 25 '22 at 22:10
  • ```treat``` is the binary variable with 0 and 1. 1 means the law enacted periods. 0 is not. For example, for the variable of treat in Georgia, year from 1979 to 1988 should be 0 and year from 1989 to 2016 should be 1. However, for the variable of treat in Idaho, year from 1979 to 1989 should be 0 and year from 1990 to 2016 should be 1. As you can see, their enacted law year is different. I need to create ```treat``` columns with this yardstick depending on each state at once. – SSW May 25 '22 at 22:23
  • How do you know the enact year? ie how do you determine that Georgia is 1989 while Idaho is 1990? – Onyambu May 25 '22 at 22:27
  • For Alabama, the variable should be 1 for all years since enacted law already before 1979. – SSW May 25 '22 at 22:29
  • But how do you know when the law was enacted?? That is my question – Onyambu May 25 '22 at 22:30
  • The information on enacted year is based on the historical fact(data) but unfortunately, the original data ```crime``` did not include this information so I need to apply this information for each state in ```mutate``` if possible. – SSW May 25 '22 at 22:31
  • Consider creating a dataframe that has all the enacted years for all the states then you can join/merge the dataframe to this data on states and then just do a year comparison. – Onyambu May 25 '22 at 22:32
  • so your suggestion is creating ```NEWDATA``` and apply like this? ```mutate(treat = case_when(year < NEWDATA ~ 0, TRUE ~ 1)``` – SSW May 25 '22 at 22:38
  • No my suggestion is create a `df2 <- data.frame(state_name =c('Alabama', 'Georgia') , enacted_year = c(1979, 1989))`, then do `left_join(your_current_dataframe, df2) %>% mutate(treat = year >=enacted_year)` – Onyambu May 25 '22 at 22:41
  • ```"Georgia", "Pennsylvania", "West Virginia"``` share the same year. Do I have to put individually like ```c(1989, 1989, 1989)```? – SSW May 25 '22 at 22:44
  • Yes. All the states. – Onyambu May 25 '22 at 22:45

1 Answers1

0

Is this what you want to do?

I've reduced the sample data to only what's needed to address the two questions.

By mutating the state variables to factor you can get the order you want. Assuming alphabetical is okay, you can then create an ID column taking the integer of the factored state_name.

You could use the state_abbr in the case_when to reduce the typing, and I'm assuming treat is then determined by both year and matching the state?

library(tidyverse)

crime <- tribble(
  ~year, ~state_abbr, ~state_name,
  1979, "AK", "Alaska", 
  1979, "AL", "Alabama", 
  1979, "AR", "Arkansas", 
  1979, "AZ", "Arizona", 
  1979, "CA", "California", 
  1989, "GA", "Georgia", 
  1989, "PA", "Pennsylvania", 
  1989, "WV", "West Virginia",
  1990, "ID", "Idaho", 
  1990, "MS", "Mississippi", 
  1990, "OR", "Oregon"
)

crime %>%
  mutate(across(starts_with("state_"), factor),
    id = as.integer(state_name),
    treat = case_when(
      year == 1989 & state_abbr %in% c("GA", "PA", "WV") ~ 1,
      year == 1990 & state_abbr %in% c("ID", "MS", "OR") ~ 1,
      TRUE ~ 0
    )
  )
#> # A tibble: 11 × 5
#>     year state_abbr state_name       id treat
#>    <dbl> <fct>      <fct>         <int> <dbl>
#>  1  1979 AK         Alaska            2     0
#>  2  1979 AL         Alabama           1     0
#>  3  1979 AR         Arkansas          4     0
#>  4  1979 AZ         Arizona           3     0
#>  5  1979 CA         California        5     0
#>  6  1989 GA         Georgia           6     1
#>  7  1989 PA         Pennsylvania     10     1
#>  8  1989 WV         West Virginia    11     1
#>  9  1990 ID         Idaho             7     1
#> 10  1990 MS         Mississippi       8     1
#> 11  1990 OR         Oregon            9     1

Created on 2022-05-26 by the reprex package (v2.0.1)

Carl
  • 4,232
  • 2
  • 12
  • 24