I have a data.table with company names and address information. I want to remove legal entities and the most common words from the company name. Therefore I wrote a function and apply this to my data.table.
search_for_default <- c("inc", "corp", "co", "llc", "se", "\\&", "holding", "professionals",
"services", "international", "consulting", "the", "for")
clean_strings <- function(string, search_for=search_for_default){
clean_step1 <- str_squish(str_replace_all(string, "[:punct:]", " ")) #remove punctation
clean_step2 <- unlist(str_split(tolower(clean_step1), " ")) #split in tokens
clean_step2 <- clean_step2[!str_detect(clean_step2, "^american|^canadian")] # clean up geographical names
res <- str_squish(str_c(clean_step2[!clean_step2 %in% search_for], sep="", collapse=" ")) #remove legal entities and common words
res <- paste(unique(unlist(str_split(res, " "))), collapse=" ") # paste string together
return(res) }
datatable[, COMPANY_NAME_clean:=clean_strings(COMPANY_NAME), by=COMPANY_NAME]
The script works well. But when I have a large dataset (>3b rows) it takes very long. Is there a more efficient way of doing this?
Examples:
Input:
Company_Name <- c("Walmart Inc.", "Amazon.com, Inc.", "Apple Inc.", "American Test Company for Consulting")
Expected:
Company_name_clean <- c("walmart", "amazon.com", "apple", "test company")