6

I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:

sampleData

  Location             Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm LLC
3 Miami, FL            Smith & Co.

Words I do not want to include in my output:

stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")

I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.

removeWords <- function(str, stopwords) {
  x <- unlist(strsplit(str, " "))
  paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(sampleData$Company,stopwords)

The output for the above function looks like this:

[1] "XYZ Company Consulting Firm Smith"

T he output should be:

 Location              Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm
3 Miami, FL            Smith

Any help would be appreciated.

Shannon
  • 63
  • 1
  • 1
  • 4

2 Answers2

9

We can use 'tm' package

library(tm)

stopwords = readLines('stopwords.txt')     #Your stop words file
x  = df$company        #Company column data
x  =  removeWords(x,stopwords)     #Remove stopwords

df$company_new <- x     #Add the list as new column and check
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
6

With a little check on the stopwords( having inserted "\" in Co. to avoid regex, spaces ): (But the previous answer should be preferred if you dont want to keep an eye on stopwords)

 stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")

 gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company"      "Consulting Firm " "Smith "       

df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
#      Location          Company
#1 New York, NY      XYZ Company
#2  Chicago, IL Consulting Firm 
#3    Miami, FL           Smith 
joel.wilson
  • 8,243
  • 5
  • 28
  • 48