1

I'm using some old code I found to take character strings (ex:)

[1] "Indianapolis, IN"         "Columbia, TN"             "Chicago, IL"              "Next door to Florida Man"
[5] "Holeintheroad, TN"        "RUCH 11 LISTOPADA"    

and find which have state abbreviations in them. I have the following:

user_info$location[user_info$location!=""&!is.na(user_info$location)] %>%
  str_match(sprintf("(%s)",paste(state.abb,collapse="|"))) %>%
  .[,2] %>%
  table() %>%
  broom::tidy() %>%
  set_names(c("NAME","n")) %>%
  as.data.frame() -> tweet_states_abbr

and where datasets::state.abb is:

 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS"
[25] "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV"
[49] "WI" "WY"

The problem is that str_match() is picking up too much info -- in addition to actual state abbreviations like "IN" or "CA," it's picking up parts of words (ex: in "MAGA" it picks up "MA" or "GA"). I know regex's can solve this, but I'm not sure how to incorporate them here with sprintf and %s since those aren't native R -- so I'm not sure where to put \b or \s. Any advice? Thanks!

G5W
  • 36,531
  • 10
  • 47
  • 80
inh2102
  • 11
  • 1

1 Answers1

1

You just need to make sure that the state abbreviations are surrounded by word-boundary markers, \\b.

TestData = c("Indianapolis, IN", "Columbia, TN", "Chicago, IL",
"Next door to Florida Man", "Holeintheroad, TN", "RUCH 11 LISTOPADA",
"MAGA")

StatePat = paste("\\b(", paste(datasets::state.abb, collapse="|"), ")\\b", sep="")
grep(StatePat, TestData, value=T)
[1] "Indianapolis, IN"  "Columbia, TN"  "Chicago, IL"  "Holeintheroad, TN"
G5W
  • 36,531
  • 10
  • 47
  • 80