Here is a data.table solution using foverlaps(...)
and the full US zip code database in package zipcode
for the example. Note that your definitions of the ranges are deficient: for instance there are zip codes in NH that are outside the NE
range, and PR is completely missing.
library(data.table) # 1.9.4+
library(zipcode)
data(zipcode) # database of US zip codes (a data frame)
zips <- data.table(zip_code=zipcode$zip)
regions <- data.table(region=c("NE" , "SE", "MW", "MW", "SW", "SW", "MT", "MT", "MT", "PC"),
start =c(01000,20000,40000,60000,70000,85000,59000,80000,88900,90000),
end =c(19999,39999,58999,69999,79999,88400,59999,84999,89999,99999))
setkey(regions,start,end)
zips[,c("start","end"):=list(as.integer(zip_code),as.integer(zip_code))]
result <- foverlaps(zips,regions)[,list(zip_code,region)]
result[sample(1:nrow(result),10)] # random sample of the result
# zip_code region
# 1: 27113 SE
# 2: 36101 SE
# 3: 55554 MW
# 4: 91801 PC
# 5: 20599 SE
# 6: 90250 PC
# 7: 95329 PC
# 8: 63435 MW
# 9: 60803 MW
# 10: 07040 NE
foverlaps(...)
works this way: suppose a data.table x has columns a
and b
that represent a range (e.g., a
<= b
for all rows), and a data.table y
has columns c
and d
that similarly represent a range. Then foverlaps(x,y)
finds, for each row in x
, all the rows in y
which have overlapping ranges.
In your case we set up the y
argument as the regions, where the ranges are the beginning and ending zipcodes for each (sub) region. Then we set up x
as the original zip code database using the actual zip codes (converted to integer) for both the beginning and end of the range.
foverlaps(...)
is extremely fast. In this case the full US zip code database (>44,000 zipcodes) was processed in about 23 milliseconds.