Convert long state names embedded with other text to two-letter state abbreviations

Question

My objective is to identify US states written out in a character vector that has other text and convert the states to abbreviated form. For example, "North Carolina" to "NC". It is simple if the vector only has long-form state names. However, my vector has other text in random places, as in the example "states".

states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")

From another post I found this:

state.abb[match(states, state.name)]

but it converts only the standalone Texas

> state.abb[match(states, state.name)]
[1] NA   NA   NA   NA   "TX"

and not the New Jersey, Alabama and Iowa strings.

From Fast grep with a vectored pattern or match, to return list of all matches I tried:

sapply(states, grep(pattern = state.name, x = states, value = TRUE))

but

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'Alabama 02138' of mode 'function' was not found
In addition: Warning message:
In grep(pattern = state.name, x = states, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

Nor does this work:

sapply(states, function(x) state.abb[grep(state.name, states)])

This question did not help: regular expression to convert state names to abbreviations

How do I convert the embedded long names to the state abbreviation?

EDIT: I want to return the vector with the only change being that the long names of the states have been abbreviated, e.g., "Plano New Jersey" becomes "Plano NJ".

Thank you for correcting and/or educating me.

You might get 'NY, NY' out of this, plus there are towns called 'California' as well as states. However, that's nitpicking for you. — Jonathan Leffler, Aug 30 '14 at 14:02
@Jonathan Leffler: yes, towns called states is an occupational hazard. Plus I have cities that are in more than one state. Sigh. Why can't data behave nicely? — lawyeR, Aug 30 '14 at 14:07
There's this thing called 'The Real World'™ that you should visit sometime ("Dear Kettle — you're black! Signed, Pot"). It has a nasty habit of diverging from the tidy schemes that those of us who write programs devise. — Jonathan Leffler, Aug 30 '14 at 14:16
Might be good to search the web for a table of "City, STATE" names and read it into R to do the matching. — Rich Scriven, Aug 30 '14 at 15:30
@RichardScriven: Good point. I found 394 of such city, ST matches on Wikipedia, created a data frame and can use that also. I am still struggling to allocate Columbus, for example, since there are cities of that name in several states. — lawyeR, Aug 30 '14 at 15:45
You could create a by-state list of each city-state combination if you can find all 50 states somewhere. — Rich Scriven, Aug 30 '14 at 15:53

akrun · Answer 1 · 2014-08-30T14:10:33.413

3

Try:

indx <- paste0(".*(", paste(state.name, collapse="|"), ").*")
v1 <- gsub(indx, "\\1", states)
ifelse( v1 %in% state.abb, v1, state.abb[match(v1, state.name)])
#[1] "NJ" "NC" NA   "AL" "TX" "IA"

If you want to just replace the states with the abbreviation and not the other text, you could also do:

indx1 <- paste(state.name, collapse="|")   
indx2 <- state.abb[match(v1, state.name)]

mapply(gsub, indx1, indx2, states, USE.NAMES=F)
#[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"     
#[5] "TX"            "Town IA 99999"

edited Aug 30 '14 at 14:10

answered Aug 30 '14 at 13:24

akrun

874,273
37
540
662

I noticed that this replaces each matching state with the abbreviation of the state which comes first in the alfabet. For example, the input `"Texas Alabama"` will result in `"AL AL"`. Is there a way to avoid that? – ebo Aug 30 '14 at 16:40
@EricBouwers In the example the OP provided, it was not the case. So, I didn't check it that way – akrun Aug 30 '14 at 17:46

Tyler Rinker · Accepted Answer · 2014-08-30T14:38:59.760

3

Here's another approach:

library(qdap)
mgsub(state.name, state.abb, states)

## [1] "Plano NJ"      "NC"            "xyz"           "AL 02138"      
## "TX"            "Town IA 99999"

If you are uncertain that the states will be capitalized you may want to use:

mgsub(state.name, state.abb, states, ignore.case=TRUE, fixed=FALSE)

edited Aug 30 '14 at 14:38

answered Aug 30 '14 at 13:53

Tyler Rinker

108,132
65
322
519

score 1 · Answer 3 · answered Aug 30 '14 at 13:24

It was not clear from the question what the expected result is to be but here we have assumed that you want to preserve the text in the input just replacing the fuil state names with the abbreviation.

Create a list, st, whose names are the full state names and whose values are the abbreviations. Then use paste(..., collapse = "|") to create a regular expression that matches any state and use gsubfn from the gsubfn package to perform the substitutions.

library(gsubfn)
st <- as.list(setNames(state.abb, state.name))
gsubfn(paste(state.name, collapse = "|"), st, states)

giving:

[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"     
[5] "TX"            "Town IA 99999"

score 1 · Answer 4 · edited May 23 '17 at 12:10

If you do not want to use additional packages you can use the mapply function to apply gsub for all pairs of state.name and state.abb, e.g.:

mapply(gsub,state.name,state.abb,"ALABAMA 123",ignore.case=TRUE,USE.NAMES=FALSE)

The result of this is a list which could contain a replacement, e.g.:

 [1] "AL 123"      "ALABAMA 123" "ALABAMA 123" "ALABAMA 123" "ALABAMA 123" 
 [6] ...

by taking the shortest text from this list you can get the desired result. Thus we sort the list based on the length of the text and take the first element.

The complete code:

replaceState <- function(x) {  
     v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
     v[order(nchar(v))][1] 
}

sapply(states, replaceState, USE.NAMES=FALSE)

Unfortunately, this approach only replaces the name of a single state (the longest). To replace multiple different states we need to iterate, e.g.:

replaceState <- function(x) {  
     v = mapply(gsub,state.name,state.abb,x,ignore.case=TRUE, USE.NAMES=FALSE)
     v[order(nchar(v))][1] 
}

replaceStates <- function(x) {
     newX = replaceState(x)

     # if they are different a state has been replaced, 
     # we try again to replace all states.
     if(newX != x){ 
          replaceStates(newX)
     } else {
          newX
     }
}

# Note the 'replaceStates'
sapply(states, replaceStates, USE.NAMES=FALSE)

rnso · Answer 5 · 2014-08-31T01:04:14.620

Try:

for(r in 1:nrow(states.list)) {
    states = gsub(states.list[r,1], states.list[r,2], states)
}

states
[1] "Plano NJ"      "NC"            "xyz"           "AL 02138"      "TX"            "Town IA 99999"

Data:

states <- c("Plano New Jersey", "NC", "xyz", "Alabama 02138", "Texas", "Town Iowa 99999")

states.list = structure(list(state.name = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("Alabama", 
"Iowa", "Minnesota", "New Jersey", "Texas"), class = "factor"), 
    state.abb = structure(c(4L, 1L, 5L, 2L, 3L), .Label = c("AL", 
    "IA", "MN", "NJ", "TX"), class = "factor")), .Names = c("state.name", 
"state.abb"), class = "data.frame", row.names = c(NA, -5L))

states.list
  state.name state.abb
1 New Jersey        NJ
2    Alabama        AL
3      Texas        TX
4       Iowa        IA
5  Minnesota        MN

Convert long state names embedded with other text to two-letter state abbreviations

5 Answers5

Linked