Discretization of zip codes to US regions in R

Question

I would like to discretize data with zip codes into regions

I have character data

sample:

zip_code
'45654'
'12321'
'99453'

etc

I have 6 categories with rules:

region 1 - NE: 01000-19999

region 2 - SE: 20000-39999

region 3 - MW: 40000-58999,60000-69999

region 4 - SW: 70000-79999,85000-88499

region 5 - MT: 59000-59999,80000-84999,88900-89999

region 6 - PC: 90000-99999

I would like my output to be factor data:

region
'MW'
'NE'
'PC'

etc

Obviously, I know many ways to discretize the data, but none are clean and elegant (like loops, ifelse, etc)

Is there an elegant way to apply a case with 6 categories to discretize this data?

score 2 · Accepted Answer · answered Nov 12 '14 at 01:41

2

Okay, messy but this can work. I assume you'll have to use character objects since some zip codes start with 0. Obs. replace these numbers with your zip codes.

zip_code <- c('1','6','15')
regions <- list(NE = as.character(1:3), 
        SE = as.character(4:6), 
        MW = as.character(7:9), 
        SW = as.character(10:12), 
        MT = as.character(13:15), 
        PC = as.character(16:19))
sapply(zip_code, function(x) names(regions[sapply(regions, function(y) x %in% y)]))

 1    6   15 
"NE" "SE" "MT"

answered Nov 12 '14 at 01:41

sckott

5,755
2
26
42

note that this is likely not the most elegant solution, I'm sure someone else will chime in with other soln's – sckott Nov 12 '14 at 02:07
There may be more elegant solutions but yours is functional, which is a mind frame that I'm having a hard time adapting. – Michal Nov 12 '14 at 16:59

jlhoward · Answer 2 · 2014-11-12T17:03:00.747

Here is a data.table solution using foverlaps(...) and the full US zip code database in package zipcode for the example. Note that your definitions of the ranges are deficient: for instance there are zip codes in NH that are outside the NE range, and PR is completely missing.

library(data.table)  # 1.9.4+
library(zipcode)
data(zipcode)        # database of US zip codes (a data frame)

zips    <- data.table(zip_code=zipcode$zip)
regions <- data.table(region=c("NE" , "SE", "MW", "MW", "SW", "SW", "MT", "MT", "MT", "PC"),
                      start =c(01000,20000,40000,60000,70000,85000,59000,80000,88900,90000),
                      end   =c(19999,39999,58999,69999,79999,88400,59999,84999,89999,99999))
setkey(regions,start,end)
zips[,c("start","end"):=list(as.integer(zip_code),as.integer(zip_code))]
result <- foverlaps(zips,regions)[,list(zip_code,region)]
result[sample(1:nrow(result),10)]   # random sample of the result
#    zip_code region
#  1:    27113     SE
#  2:    36101     SE
#  3:    55554     MW
#  4:    91801     PC
#  5:    20599     SE
#  6:    90250     PC
#  7:    95329     PC
#  8:    63435     MW
#  9:    60803     MW
# 10:    07040     NE

foverlaps(...) works this way: suppose a data.table x has columns a and b that represent a range (e.g., a <= b for all rows), and a data.table y has columns c and d that similarly represent a range. Then foverlaps(x,y) finds, for each row in x, all the rows in y which have overlapping ranges.

In your case we set up the y argument as the regions, where the ranges are the beginning and ending zipcodes for each (sub) region. Then we set up x as the original zip code database using the actual zip codes (converted to integer) for both the beginning and end of the range.

foverlaps(...) is extremely fast. In this case the full US zip code database (>44,000 zipcodes) was processed in about 23 milliseconds.

akrun · Answer 3 · 2014-11-12T04:39:17.610

You could also try (Using @Scott Chamberlain's data)

  with(stack(regions), unique(ind[ave(values %in% zip_code, ind, FUN=I)]))
 #[1] NE SE MT
 #Levels: MT MW NE PC SE SW

Or

 library(dplyr)
 library(tidyr)
 unnest(regions, region) %>%
                     group_by(region) %>%
                     filter(x %in% zip_code)

 # region x
 #1   NE  1
 #2   SE  6
 #3   MT 15

Or

 r1 <- vapply(regions, function(x) any(x %in% zip_code), logical(1))
 names(r1)[r1]
 #[1] "NE" "SE" "MT"

Discretization of zip codes to US regions in R

3 Answers3