-1

Is there a way to find out most frequently words used in a column of strings in a data frame in r? I came across lots of functions for doing that using text corpus but none for a dataf rame. I need to do it for a data frame so that i can create "Metadata" for the products. Below is an example of the data i have and the result i am trying to achieve. Any help is highly appreciated. Thanks!

Product data for a grocer

enter image description here

Now i want to find the most frequently occurring word from the "combineall" column and list those in a new column next to it. Basically i am trying to create metadata from the product description. Thanks again!

  • 2
    Please don't add data using images add them by `dput` instead. Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). Have you tried this ? https://stackoverflow.com/questions/37291984/find-the-most-frequently-occuring-words-in-a-text-in-r – Ronak Shah Jul 27 '20 at 04:11
  • @RonakShah thank you for your response. I will try to repost the question so that the data is not an image and can be used. Also the question you shared didn't solve my problem. I want to maintain the same structure so that the most frequent word from each row is always linked to the UPC (product code). Hopefully this will provide some clarity. Thanks again! – ABHIK ARORA Jul 27 '20 at 04:21
  • 1
    Perhaps, add only few rows with `dput(head(df))` and show expected output. You can delete the answer below since it is not actually an answer. – Ronak Shah Jul 27 '20 at 04:25
  • Also give an example of the expected output. Because the text in combineall often has only 1 occurence of each word. The result in counting the words is almost again the whole text in combineall (minus stopwords). You might be better of by checking which words (or ngrams) occur per department or category (major or sub). – phiver Jul 27 '20 at 08:58

2 Answers2

0

If you use stringr, I can think of this as a two-step process.

The first step is to extract the information from the column "combineall" like so:

DF2 <- DF %>% stringr::str_glue_data("{rownames(.)} combineall: {combineall}")

A Base R alternative would be

do.call(sprintf, c(fmt = "combineall: %s", DF))

Then you could try the following to get the simple function to count the words

# function to count words in a string
countwords = function(strings){
  
  # remove extra spaces between words
  wr = gsub(pattern = " {2,}", replacement=" ", x=strings)
  
  # remove line breaks
  wn = gsub(pattern = '\n', replacement=" ", x=wr)
  
  # remove punctuations
  ws = gsub(pattern="[[:punct:]]", replacement="", x=wn)
  
  # split the words
  wsp =  strsplit(ws, " ")
  
  # sort words in table
  wst = data.frame(sort(table(wsp, exclude=""), decreasing=TRUE))
  wst
}
countwords(DF2)

Then add the most frequent words back into your data. Hope this is what you wanted and it helped you.

GSA
  • 751
  • 8
  • 12
0

Sample data:

df <- data.frame(
  combineall = c("some words", "more of the same", "again words", "different items", "and more and more")
)

Make a frequency table:

freqtable <- sort(table(unlist(strsplit(df$combineall, " "))), decreasing = T)

Select the top 3 most frequent words and define them as an alternation pattern:

top3 <- paste0("(", paste0("\\b", names(freqtable)[1:3], "\\b", collapse = ""), ")", collapse = "|")

Now lapply the function grep (with the argument value = T) to match the top 3 most frequent words in the column combineall:

df$top3 <- lapply(strsplit(df$combineall, " "), 
                  function(x) paste0(grep(top3, x, value = T), collapse = ","))

Result:

The dataframe now lists which of the top3 items occur in each string in combineall:

df
         combineall              top3
1        some words             words
2  more of the same              more
3       again words             words
4   different items                  
5 and more and more and,more,and,more
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • I've edited the solution to provide the sought new column which records which of the top most frequent words occur in each string. Does this help? – Chris Ruehlemann Jul 27 '20 at 13:25