How do I use regex in R to create a new column of canonicalized company names?

Question

I have a dataframe with a column of company names. I want to create a new column that is a fuzzy/canonicalized version of the name (perhaps using regex to strip suffixes like "corporation, "inc", and "llc" and prefixes like "the").

name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
companies <- data.frame(name)

I want company$canonicalized_name to return

"microsoft", "apple", "youtube", "huffington post"

How can I write this regex pattern in R?

It would be a lot more intuitive to have the intersection first and then using `agrep` to find the closest match to the names in list_1 and list_2. e.g. `lookup <- c("microsoft", "apple", "youtube", "huffington post"); lapply(lookup, agrep, c(list_1, list_2), value=T)` — Adam Quek, May 02 '17 at 03:00
a. Those are vectors, not lists; b. What have you tried so far?; c. `adist` is one starting point. — alistaire, May 02 '17 at 03:05

score 1 · Accepted Answer · answered May 02 '17 at 06:54

I don't know what rules should apply to normalize your data but if you just want to (a) delete everything following a comma and then convert the string to lower case (as you do in your example), you can e.g. do this using

library(dplyr)
library(stringr)
name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
companies <- data.frame(name) %>%
        dplyr::mutate(canonicalized_name = stringr::str_replace(name, ",.*", "") %>% tolower)

companies
#              name canonicalized_name
# 1       Microsoft          microsoft
# 2     Apple, Inc.              apple
# 3    Youtube, LLC            youtube
# 4 Huffington Post    huffington post

How do I use regex in R to create a new column of canonicalized company names?

1 Answers1