0

When I use the following code, the estimate (dissimilarity, D) is the same for all Counties (0.648). I'm wondering if it has to do with lack of geometry information, since I created the County_FIPS variable. Looking for suggestions on how to fix this code or do the operation differently. The goal is dissimilarity indices for all counties in the U.S. I ran a batch of half the states first to reduce the size/time it took. (Beginner/Intermediate User)

my_states <- c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI",
               "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI")

#my_states2 <- c ("MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", 
               "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX",
               "UT", "VT", "VA", "WA", "WV", "WI", "WY") 

acs_data1 <- get_acs(
  geography = "tract",
  variables = c(
    white = "B03002_003",
    black = "B03002_004",
    asian = "B03002_006",
    hispanic = "B03002_012"), 
  state = my_states,
  geometry = TRUE,
  year = 2019
) 

seg_acs_data <- acs_data1 %>% 
  mutate(COUNTY_FIPS = substr(GEOID, 1, 5))

subsetseg <- seg_acs_data %>% filter(variable %in% c("white", "black"))
  
dissimilarity <- subsetseg %>% group_by(COUNTY_FIPS) %>%
  group_modify(~
                 dissimilarity(data = subsetseg,
                               group = "variable",
                               unit = "GEOID",
                               weight = "estimate"
                 )) 

-- Without saving the output as an object ("dissimilarity") the output is:

# A tibble: 1,314 x 3
# Groups:   COUNTY_FIPS [1,314]
   COUNTY_FIPS stat    est
   <chr>       <chr> <dbl>
 1 01001       D     0.648
 2 01003       D     0.648
 3 01005       D     0.648
 4 01007       D     0.648
 5 01009       D     0.648
 6 01011       D     0.648
 7 01013       D     0.648
 8 01015       D     0.648
 9 01017       D     0.648
10 01019       D     0.648
# ... with 1,304 more rows
Geraldine
  • 771
  • 3
  • 9
  • 23

1 Answers1

1

If you look at the relevant section of your code here:

             dissimilarity(data = subsetseg,
                           group = "variable",
                           unit = "GEOID",
                           weight = "estimate"
             )

you'll notice that you are passing the entire dataset subsetseg to dissimilarity() for each group, which is why you are getting the same result for each county. Given that you are using formula notation with ~, that section should use .x:

             dissimilarity(data = .x,
                           group = "variable",
                           unit = "GEOID",
                           weight = "estimate"
             )

However, this operation will fail as some counties in the US are single-tract counties for which dissimilarity cannot be calculated. So you'll want to refine your approach a bit.

kwalkertcu
  • 1,011
  • 6
  • 8