0

This is my first Question and I couldn't find a answer on the web..

I want to make a statistic of the absolute number of cases in the 5 boroughs of NYC.

PROBLEM:

I have a column in a data frame in R, which is untidy. It should contain only the boroughs of NYC for each case. But it is mixed of 250+ cases which are one of the 3 below:

  1. Zip Codes which are either Bronx, Brooklyn, Manhattan, Queens or Staten Island.
  2. abbreviations like 'bx-12345' for Bronx, 'qu-23452' for Queens, 'ma-23412' for manhattan(...) and
  3. half of the cases is really 'Bronx', 'Brooklyn', 'Manhattan' (...). No need to change it.

QUESTION: How to match it with the Borough?

I have a list of the Zip Codes of each Borough.

  1. How can I exchange the Zip codes with the matching borough?

  2. How can I take the first 2 letters of a string like 'BX' in 'BX-12345' to match it for all Boroughs?

Examples from the df:

[1] "BK-14" "QN-12" "BX-12" "SI-01" "BX-03" "11216" "SI-01" "BX-05" "11249" "42"
[11] "10009" "QN-05" "10453" "10453" "10453" "10453" "10453" "10453" "10453" "10457"
[21] "10458" "BX-04" "10453" "10453" "10453" "10453" "10467" "QN-07" "10029" "10028"
[31] "11221" "10452" "11206" "Manhattan" "Manhattan" "36" "11212" "Manhattan" "10032" "10459"
[41] "10459" "10459" "10459" "BK-12" "BK-15" "BX-12" "10467" "10029" "SI-01" "36"
[51] "11232" "11206" "10002" "10455" "10455" "11239" "11239" "11239" "11239" "11239"
[61] "11239" "11239" "11239" "SI-03" "11239" "11239" "11239" "11239" "11239" "11239"
[71] "11239" "11239" "11239" "11239" "11239" "11239" "10459" "10455" "10468" "11216"
[81] "11249" "10460" "Brooklyn" "11206" "11238" "10459" "11206" "11206" "10468" "10468"
[91] "10002" "10456" "SI-01" "QN-08" "BX-09" "10022" "11691" "Manhattan" "Manhattan" "BX-07"

1 Answers1

0

Consider a data.frame with untidy boroughs:

df <- data.frame(borough = c("12345", "BX-12345", "Bronx"), stringsAsFactors = FALSE)
df
#>    borough
#> 1    12345
#> 2 BX-12345
#> 3    Bronx

And a mapping table:

map <- data.frame(zip = "12345", abbr = "BX", name = "Bronx", stringsAsFactors = FALSE)
map
#>     zip abbr  name
#> 1 12345   BX Bronx

First reduce "abbr-zip" to "abbr":

df$borough <- gsub("^([^-]*)-.*$", "\\1", df$borough)

LEFT OUTER JOIN with mapping table on zip, taking in all matching values:

df1 <- merge(df, map, by.x = "borough", by.y = "zip", all.x = TRUE)
df$borough <- ifelse(is.na(df1$name), df1$borough, df1$name)

LEFT OUTER JOIN with mapping table on abbr, taking in all matching values:

df2 <- merge(df, map, by.x = "borough", by.y = "abbr", all.x = TRUE)
df$borough <- ifelse(is.na(df2$name), df2$borough, df2$name)

Result:

df
#>   borough
#> 1   Bronx
#> 2   Bronx
#> 3   Bronx

Actually it is simpler to map "abbr-zip" to "zip" with

df$borough <- gsub("^[^-]*-(.*)*$", "\\1", df$borough)

After that the first LEFT OUTER JOIN is sufficient. And you do not need abbr in the mapping table.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75