2

I'm trying to produce a choropleth map of county level data on COVID-19 infections using R. I'm a relative newbie to R so....

I've done some fairly basic stuff with ggmap to plot spatial data, but never anything quite like this. Typically I just have points of interest that I need to overlay on a map, so I can use geom_point and their lat/lon. In this case I need to construct the underlying map and then fill regions and I'm struggling with doing that in the ggplot world.

I've followed some online examples I've found to get as far as this:

library(ggplot2)
library(broom)
library(geojsonio)

#get a county level map geoJSON file
counties <- geojson_read("https://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_500k.json", what = "sp")

#filter our alaska and Hawaii
lower48 <- counties[(counties@data$STATE != "02" & counties@data$STATE != "15") ,]

#turn it into a dataframe for ggmap
new_counties <- tidy(lower48)

# Plot it
print(ggplot() +
  geom_polygon(data = new_counties, aes( x = long, y = lat, group = group), fill="#69b3a2", color="white") +
  theme_void() +
  coord_map())

Which produces this plot:

enter image description here

So far so good. But my new_counties dataframe now looks like this:

head(new_counties)
# A tibble: 6 x 7
   long   lat order hole  piece group id  
  <dbl> <dbl> <int> <lgl> <chr> <chr> <chr>
1 -85.4  33.9     1 FALSE 1     0.1   0    
2 -85.4  33.9     2 FALSE 1     0.1   0    
3 -85.4  33.9     3 FALSE 1     0.1   0    
4 -85.4  33.9     4 FALSE 1     0.1   0    
5 -85.4  33.9     5 FALSE 1     0.1   0    
6 -85.4  33.8     6 FALSE 1     0.1   0 

So I've lost anything that I might be able to tie back to my county level data on infections.

My data has a 5-digit FIPS code for each county. First two digits are the state and last three are the county. My geoJSON file has a much more detailed FIPS code. I tried grabbing just the first 5 and creating my own data element I could map back to

library(ggplot2)
library(broom)
library(geojsonio)

#get a county level map geoJSON file
counties <- geojson_read("https://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_050_00_500k.json", what = "sp")

#filter our alaska and Hawaii
lower48 <- counties[(counties@data$STATE != "02" & counties@data$STATE != "15") ,]

#add my own FIPS code
lower48@data$myFIPS <- substr(as.character(lower48@data$GEO_ID),1,5)  

#turn it into a dataframe for ggmap
new_counties <- tidy(lower48, region = "myFIPS")


# Plot it
print(ggplot() +
  geom_polygon(data = new_counties, aes( x = long, y = lat, group = group), fill="#69b3a2", color="white") +
  theme_void() +
  coord_map())

But that produces this plot

enter image description here

And I have to say I'm not quite familiar enough with broom::tidy to know exactly why. I also notice as I type this that I need to filter out Puerto Rico!

If anybody can point me back in a useful direction....I'm not wedded to the current approach, though I would like to stick to ggplot2 or ggmap. My boss eventually wants me to overlay certain features. Ultimately the goal is to follow the example here and produce an animated map showing data over time, but I'm obviously a long way from that.

jerH
  • 1,085
  • 1
  • 12
  • 30
  • 1
    use `sf::st_read()` to read the geojson, so you get it as an `sf` object. Then use `ggplot2::geom_sf()` to plot it. - `sf` is the successor to `sp`, so I would recommend anyone to move away from `sp`. – SymbolixAU Mar 24 '20 at 23:48
  • 1
    where an `sf` object **is** already a data.frame. So all standard filtering and subsetting operations 'just work'. – SymbolixAU Mar 24 '20 at 23:49

1 Answers1

4

There's many ways to do this, but the concept is basically the same: Find a map containing country level FIPS codes and use them to link with a data source, also containing the same FIPS codes as well as the variable for plotting (here the number of covid-19 cases per day).

#devtools::install_github("UrbanInstitute/urbnmapr")
library(urbnmapr) # For map
library(ggplot2)  # For map
library(dplyr)    # For summarizing
library(tidyr)    # For reshaping
library(stringr)  # For padding leading zeros

# Get COVID cases, available from:
url <- "https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv
             ?_ga=2.162130428.136323622.1585096338-408005114.1585096338"

COV <- read.csv(url, stringsAsFactors = FALSE)
names(COV)[1] <- "countyFIPS"  # Fix the name of first column. Why!?

The data are stored in wide format with daily cases per county spread across columns. This needs to be gathered before merging with the map. The dates need to be converted to proper dates. The FIPS codes are stored as integers, so these need to be converted to a character with leading 0s in order to merge with the map data. I use the urbnmap package for the map data.

Covid <- pivot_longer(COV, cols=starts_with("X"), 
                     values_to="cases",
                     names_to=c("X","date_infected"),
                     names_sep="X") %>%                
  mutate(date_infected = as.Date(date_infected, format="%m.%d.%Y"),
         countyFIPS = str_pad(as.character(countyFIPS), 5, pad="0"))

# Obtain map data for counties (to link with covid data) and states (for showing borders)
states_sf <- get_urbn_map(map = "states", sf = TRUE)
counties_sf <- get_urbn_map(map = "counties", sf = TRUE)

# Merge county map with total cases of cov
counties_cov <- inner_join(counties_sf, group_by(Covid, countyFIPS) %>%
       summarise(cases=sum(cases)), by=c("county_fips"="countyFIPS"))

counties_cov %>%
  ggplot() +
  geom_sf(mapping = aes(fill = cases), color = NA) +
  geom_sf(data = states_sf, fill = NA, color = "black", size = 0.25) +
  coord_sf(datum = NA) +   
  scale_fill_gradient(name = "Cases", trans = "log", low='pink', high='navyblue', 
                      na.value="white", breaks=c(1, max(counties_cov$cases))) +
  theme_bw() + theme(legend.position="bottom", panel.border = element_blank())

enter image description here


For animation, you can use the gganimate package and transition through the days. The commands are similar to above except that the covid data should not be summarized.

library(gganimate)

counties_cov <- inner_join(counties_sf, Covid, by=c("county_fips"="countyFIPS"))

p <- ggplot(counties_cov) + ... # as above

p <- p + transition_time(date_infected) +
  labs(title = 'Date: {frame_time}')

animate(p, end_pause=30)

enter image description here

Edward
  • 10,360
  • 2
  • 11
  • 26
  • That is awesome...I'd made some progress using the `choroplethr` package, but this puts me to shame. I'll study this... – jerH Mar 25 '20 at 14:00
  • For some reason it didn't like the countyFIPS column name it read from the URL and stuck the i umlaut (or whatever you call it) on there, so I manually removed that. Do you know what transformation is required to overlay some lat/lon geom_points on the map? As is, they all just wind up in South Dakota with a warning `Transformation introduced infinite values in discrete y-axis`. I ran into that with another package I tried and it had to do with the projection the map was using. That package provided a transform() function to make lat/lon's plottable... – jerH Mar 25 '20 at 14:44
  • Disregard...figured it out using the `usmap_transform` function from the `usmap` package – jerH Mar 25 '20 at 15:37
  • New issue though, when I attempt to animate I get `Error: time data must either be integer, numeric, POSIXct, Date, difftime, orhms In addition: Warning message: In min(cl[cl != 0]) : no non-missing arguments to min; returning Inf` . date_infected entries look like `1.22.2020` do I need to use `as.date()` or `strptime()` or similar to transform these? – jerH Mar 25 '20 at 15:49
  • That did it.... `as.date()` and it's all good. One last wrinkle to figure out..thanks again – jerH Mar 25 '20 at 16:47
  • Upon further review, you had already as.date'd...just to `infected` so I just need to change `transition_time(date_infected)` to `transition_time(infected)`. Do you know why some of the counties start out as gray and others white? I believe in both cases the value is 0. I'd like to get rid of the starting grays if possible... – jerH Mar 25 '20 at 21:25
  • 1
    I fixed that small bug about date_infected. Concerning the grey counties, those have cases on at least one day during the whole period, while white counties have no cases at any time. I thought it was a nice "feature", but there should be a way to disable that. ... – Edward Mar 25 '20 at 23:19
  • Oh okay....that's cool then. I thought ít was just inconsistent with 0s for some reason....as long as I can explain that. Thanks for all your help! – jerH Mar 26 '20 at 00:01
  • 1
    If you don't like the grey, you can add na.values="white" or na.values="transparent" in the scale_fill_gradient function. – Edward Mar 26 '20 at 00:42
  • I updated packages and there are no longer .png files in the working directory, so gifski has nothing to encode. I had been using `print(p + transition_time(infected) + labs(title='COVID-19 Cases: {frame_time}'))` in my script and it was fine...any ideas? Thanks! – jerH Mar 27 '20 at 19:35
  • I actually went ahead and posted a new question about itt [here](https://stackoverflow.com/questions/60893598/why-are-ggplot-output-files-no-longer-appearing) – jerH Mar 27 '20 at 20:34