-1

I have created a function that uses st_join() from the sf package to extract the congressional district (a polygon) from a set of latitude and longitude coordinates, using a different shapefile to identify the congressional district depending on a "congress" argument that is specified. (This is necessary because districts are periodically redrawn, so the boundaries change over time.) The next step is to apply the function row by row to a data frame containing multiple rows of coordinates (and associated "congress" values) so that the congress value for a given row determines which shapefile to use, and then assign the extracted district to a new variable.

I'm running into trouble applying this function row-by-row. I first tried using the rowwise() and mutate() functions from dplyr, but got a "must be size 1" error. Based on the comments to this question, I put list() around the variable assigned inside the mutate() function, but this has resulted in the new variable being a list instead a single character string.

I would greatly appreciate help figuring out a way to either (i) modify the function so that it can be applied row by row using rowwise() and mutate() or (ii) apply my function row-by-row in some other way.

Reproducible code is below; you just need to download two shapefiles from https://cdmaps.polisci.ucla.edu/ ("districts104.zip" and "districts111.zip"), unzip them, and put them in your working directory.

library(tidyverse)
library(sf)

districts_104 <- st_read("districts104.shp")
districts_111 <- st_read("districts111.shp")

congress <- c(104, 111)
latitude <- c(37.32935, 37.32935)
longitude <- c(-122.00954, -122.00954)
df_test <- data.frame(congress, latitude, longitude)

point_geo_test <- st_as_sf(df_test,
                             coords = c(x = "longitude", y = "latitude"),
                             crs = st_crs(districts_104)) # prep for st_join()

sf_use_s2(FALSE) # preempt evaluation error that would otherwise pop up when using the st_join function

extract_district <- function(points, cong) {
  shapefile <- get(paste0("districts_", cong))
  st_join_results <- st_join(points, shapefile, join = st_within)
  paste(st_join_results$STATENAME, st_join_results$DISTRICT, sep = "-")
}

point_geo_test <- point_geo_test %>%
  rowwise %>%
  mutate(district = list(extract_district(points = point_geo_test, cong = congress)))
RSS
  • 163
  • 1
  • 2
  • 11
  • Try `mutate(district = list(extract_district(points = point_geo_test_2, cong = congress)))` that way each row returns a list of length 1. – MrFlick Jul 01 '22 at 20:23
  • Thanks, that does fix the "must be size 1" error, but now the new variable is a list (of length 2 with the test data when used on `point_geo_test_2`, apparently corresponding to the number of rows in the data frame) instead of a single value. The variable should contain just be the district (a character string) that the `extract_district()` function returns. – RSS Jul 01 '22 at 20:33
  • Apparently it returns some of length 2. – harre Jul 01 '22 at 21:02
  • It does return a list of length 2 if applied to the entire data frame (e.g., call `extract_district(points = point_geo_test_2, cong = 108)`), but that is because the data frame contains multiple rows. What I'm trying to do is apply the function row by row, such that the function considers only a single row's data at a time. Perhaps a workaround would be to select the list element according to the row that is being iterated on? I know how to select an element from a list (you just add `[[x]]`, where `x` is the element number), but how does one embed the row number? – RSS Jul 01 '22 at 21:15
  • Could the essential problem be reproduced using less code and less shapefiles to download? I'd love to help if it wasn't so much work. – Caspar V. Jul 02 '22 at 23:17
  • Hi @CasparV. Thanks for your willingness to help! I've tried to simplify the code with just the essential problem, and do so such that only two shapefiles are needed (districts104 and districts111). Is that straightforward enough? – RSS Jul 03 '22 at 04:10

1 Answers1

2

Edit 7 July:

From your comments I understand you were looking for something different, the assumption I made about why your function was giving multiple values was wrong. Hence this new answer from scratch:


The custom function you've written doesn't lend itself to row-by-row application, because it already processes all rows at once:

Given the following input:

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test contains these values:

> point_geo_test
[...]
  congress                   geometry
1      104 POINT (-122.0095 37.32935)
2      111 POINT (-122.0095 37.32935)
3      104   POINT (73.72036 41.1134)
4      111   POINT (73.72036 41.1134)
5      104 POINT (-87.86885 42.15549)
6      111 POINT (-87.86885 42.15549)

and extract_district() returns this:

> extract_district(point_geo_test, 104)
[...]
[1] "California-14" "California-14" "NA-NA"         "NA-NA"         "Illinois-10"   "Illinois-10"

This is already a result for each row. The only problem is, while they are the correct results for the coordinates of each row, they the name for those coordinates only during congress 104. Hence, these values are only valid for the rows in point_geo_test where congress == 104.

Extracting correct values for all rows

We will create a function that returns the correct data for all rows, eg the correct name for the coordinates during the associated congress.

I've simplified your code slightly: the df_test is not an intermediate data frame any more, but defined directly in the creation of point_geo_test. Any values I extract, I'll save into this data frame as well.

library(tidyverse)
library(sf)
sf_use_s2(FALSE)

districts_104 <- st_read("districts104.shp")
districts_111 <- st_read("districts111.shp")

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test <- st_as_sf(data.frame(congress, latitude, longitude),
                           coords = c(x = "longitude", y = "latitude"),
                           crs = st_crs(districts_104))

To keep the code more flexible and organized, I'll create a generic function that can fetch any parameter for the given coordinates:

extract_values <- function(points, parameter) {
  # initialize return values, one for each row in `points`
  values <- rep(NA, nrow(points))
  
  # for each congress present in `points`, lookup parameter and store in the rows with matching congress
  for(cong in unique(points$congress)) {
    shapefile <- get(paste0("districts_", cong))
    st_join_results <- st_join(points, shapefile, join = st_within)
    values[points$congress == cong] <- st_join_results[[parameter]][points$congress == cong]
  }
  
  return(values)
}

Examples:

> extract_values(point_geo_test, 'STATENAME')
[1] "California" "California" NA           NA           "Illinois"   "Illinois"  
> extract_values(point_geo_test, 'DISTRICT')
[1] "14" "15" NA   NA   "10" "10"

Storing values

point_geo_test$state <- extract_values(point_geo_test, 'STATENAME')
point_geo_test$district <- extract_values(point_geo_test, 'DISTRICT')
point_geo_test$name <- paste(point_geo_test$state, point_geo_test$district, sep = "-")

Result:

> point_geo_test
Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -122.0095 ymin: 37.32935 xmax: 73.72036 ymax: 42.15549
Geodetic CRS:  GRS 1980(IUGG, 1980)
  congress      state district          name                   geometry
1      104 California       14 California-14 POINT (-122.0095 37.32935)
2      111 California       15 California-15 POINT (-122.0095 37.32935)
3      104       <NA>     <NA>         NA-NA   POINT (73.72036 41.1134)
4      111       <NA>     <NA>         NA-NA   POINT (73.72036 41.1134)
5      104   Illinois       10   Illinois-10 POINT (-87.86885 42.15549)
6      111   Illinois       10   Illinois-10 POINT (-87.86885 42.15549)
Caspar V.
  • 1,782
  • 1
  • 3
  • 16
  • Thank you, Caspar! This is super helpful. Unfortunately, I tested it on my actual data, and it doesn't seem to be returning the right districts (e.g., lat 41.1134016, -long 73.720356, which is in NY, returns CA-14). I will try to troubleshoot, but to answer your first question, the function is supposed to return a character string for the congressional district in which the latitude/longitude coordinates lie. Because districts are identified by state AND number (e.g., "California's 1st" is different from "Illinois's 1st"), both state and district number are necessary to identify a district. – RSS Jul 07 '22 at 18:19
  • @RSS Testing here with those coords for districts_104 and districts_111 returns `"New York-19" "New York-18"`. Please try again, if you need any specific examples let me know. – Caspar V. Jul 07 '22 at 18:53
  • @RSS Good to know that extracting a character string was intentional. Was I also correct in assuming that you only want the first result from your original `extract_district()`? – Caspar V. Jul 07 '22 at 18:59
  • Hmm, I'm getting `"California-14"` and `"California-15"` for these coordinates if they're added to the end of the example dataset. I'll post sample code in another comment shortly. I think maybe the problem is that the coordinates from the first row are being used to extract the district for subsequent rows? To answer your question, I'm trying to pull the result from `extract_district()` as applied to a particular row, where the coordinates for that row determine the location and and the `congress` value for the row determines which shapefile to use. – RSS Jul 07 '22 at 19:24
  • Sample code: `congress <- c(104, 111, 104, 111, 104, 111)` `latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)` `longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)` `df_test <- data.frame(congress, latitude, longitude)` Calling `extract_district(point_geo_test, point_geo_test$congress)` yields `"California-14" "California-15" "California-14" "California-15" "California-14" "California-15"`, which appears to just be the district from the first row's coordinates repeated. – RSS Jul 07 '22 at 19:26
  • It works! Brilliant! Thank you for all your great help, Caspar, and apologies for the lack of clarity on my part. – RSS Jul 07 '22 at 23:42