2

I have a data.table with company names and address information. I want to remove legal entities and the most common words from the company name. Therefore I wrote a function and apply this to my data.table.

search_for_default <- c("inc", "corp", "co", "llc", "se", "\\&", "holding", "professionals", 
                     "services", "international",  "consulting", "the", "for")

clean_strings <- function(string, search_for=search_for_default){
     clean_step1 <- str_squish(str_replace_all(string, "[:punct:]", " ")) #remove punctation
     clean_step2 <- unlist(str_split(tolower(clean_step1), " ")) #split in tokens
     clean_step2 <- clean_step2[!str_detect(clean_step2, "^american|^canadian")]  # clean up geographical names
     res <- str_squish(str_c(clean_step2[!clean_step2 %in% search_for], sep="", collapse=" "))   #remove legal entities and common words
     res <- paste(unique(unlist(str_split(res, " "))), collapse=" ")  # paste string together
     return(res) }

datatable[, COMPANY_NAME_clean:=clean_strings(COMPANY_NAME), by=COMPANY_NAME]

The script works well. But when I have a large dataset (>3b rows) it takes very long. Is there a more efficient way of doing this?

Examples:

Input:

Company_Name <- c("Walmart Inc.", "Amazon.com, Inc.", "Apple Inc.", "American Test Company for Consulting")

Expected:

Company_name_clean <- c("walmart", "amazon.com", "apple", "test company")
Judy
  • 35
  • 5

0 Answers0