-1

I have a dataframe with a column of company names. I want to create a new column that is a fuzzy/canonicalized version of the name (perhaps using regex to strip suffixes like "corporation, "inc", and "llc" and prefixes like "the").

name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
companies <- data.frame(name)

I want company$canonicalized_name to return

"microsoft", "apple", "youtube", "huffington post"

How can I write this regex pattern in R?

jisoo shin
  • 540
  • 6
  • 15
  • It would be a lot more intuitive to have the intersection first and then using `agrep` to find the closest match to the names in list_1 and list_2. e.g. `lookup <- c("microsoft", "apple", "youtube", "huffington post"); lapply(lookup, agrep, c(list_1, list_2), value=T)` – Adam Quek May 02 '17 at 03:00
  • a. Those are vectors, not lists; b. What have you tried so far?; c. `adist` is one starting point. – alistaire May 02 '17 at 03:05

1 Answers1

1

I don't know what rules should apply to normalize your data but if you just want to (a) delete everything following a comma and then convert the string to lower case (as you do in your example), you can e.g. do this using

library(dplyr)
library(stringr)
name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
companies <- data.frame(name) %>%
        dplyr::mutate(canonicalized_name = stringr::str_replace(name, ",.*", "") %>% tolower)

companies
#              name canonicalized_name
# 1       Microsoft          microsoft
# 2     Apple, Inc.              apple
# 3    Youtube, LLC            youtube
# 4 Huffington Post    huffington post
ikop
  • 1,760
  • 1
  • 12
  • 24