Create Frequency table using R and Term document Matrix

Question

I have created the following dataframe consisting of a few e-mail subject lines.

 df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
                            'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'))

I have created a list of frequent words derived from the above dataframe. I have added these keywords to the dataframe and dummy coded them as 0

 most_freq_words <- c('Free', 'New', 'Limited', 'Offer')



Subject                                               Free New Limited Offer                                                    

 'Free Free Free! Clear Cover with New Phone',          0   0     0      0
 'Offer ! Buy New phone and get earphone at             0   0     0      0
 1000. Limited Offer!'

I want to obtain a frequency count of the words in the e mail subject. The output should as follows

  Subject                                             Free New Limited Offer                                                    

 'Free Free Free!  Clear Cover with New Phone',         3   1     0      0
 'Offer ! Buy New phone and get earphone at             0   1     1      2
 1000. Limited Offer!'

I have tried the following code

for (i in 1:length(most_freq_words)){
df[[most_freq_words[i]]] <- as.numeric(grepl(tolower(most_freq_words[i]), 
tolower(df$subject)))}

This however tells if the word is present or not in the sentence. I need the output given above. I request someone to help me

jazzurro · Answer 1 · 2018-02-16T07:16:54.823

I handled this task with the tidytext package. First, I added a grouping variable in the data set. Then, I separated words using the unnest_token(). I removed all words except the words in most_freq_words. Then, I counted how many times each word appeared in each sentence. Finally, I converted a long-format data to a wide-format one. If you still want to have the original sentence, you can add it to the output easily (e.g., adding cbind(subject = df$subject) after the spread() line)

library(dplyr)
library(tidytext)

df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
                           'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'),
                 stringsAsFactors = FALSE)

most_freq_words <- c('Free', 'New', 'Limited', 'Offer')

mutate(df, group = 1:n()) %>%
unnest_tokens(input = subject, output = word, token = "words", to_lower = FALSE) %>%
filter(word %in% most_freq_words) %>%
count(group, word) %>%
spread(key = word, value = n, fill = 0)

  group  Free Limited   New Offer
  <int> <dbl>   <dbl> <dbl> <dbl>
1     1  3.00    0     1.00  0   
2     2  0       1.00  1.00  2.00

akrun · Accepted Answer · 2018-02-16T07:13:31.963

Here is another option with tidyverse. We use map to loop over the 'most_freq_words', get its count from 'subject' column of 'df' with str_count, convert to tibble, set the names of the column from the 'most_freq_words' and bind the columns with the original dataset 'df'

library(tidyverse)
most_freq_words %>% 
      map(~ str_count(df$subject, .x) %>%
                    as_tibble %>% 
                    set_names(.x)) %>% 
      bind_cols(df, .)
#                                                         subject Free New Limited Offer
#1                 Free ! Free! Free ! Clear Cover with New Phone    3   1       0     0
#2 Offer ! Buy New phone and get earphone at 1000. Limited Offer!    0   1       1     2

MKR · Answer 3 · 2018-02-16T07:12:16.077

2

Replace grepl with gregexpr and then check for the length for the 1st list item. Moreover the for-loop should be run over every row of df as well. Keeping the for-loop intention of OP the modified code will look as:

for (i in 1:length(most_freq_words)){
  for(j in 1:nrow(df)){
    df[j,most_freq_words[i]] <- ifelse(gregexpr(tolower(most_freq_words[i]),
       tolower(df$subject[j]))[[1]][[1]] >0,
    length(gregexpr(tolower(most_freq_words[i]), tolower(df$subject[j]))[[1]]), 0)
  }
}  


> df
                                                         subject Free New Limited Offer
1                Free ! Free! Free ! Clear Cover with New  Phone    3   1       0     0
2 Offer ! Buy New phone and get earphone at 1000. Limited Offer!    0   1       1     2

edited Feb 16 '18 at 07:12

answered Feb 16 '18 at 06:58

MKR

19,739
4
23
33

1

This works neatly. Thank you. Can I request an explanation. I ama bit vague as to how it works – Raghavan vmvs Feb 16 '18 at 08:57
@Raghavanvmvs The value 1st item returned by `gregexpr` tells about if match is successful or not. If that value is `-1` then word couldnot found in text. The 2nd one gives vector with match index from text. Hence `length` of that vector will provide you number of occurrence. – MKR Feb 16 '18 at 10:50

Create Frequency table using R and Term document Matrix

3 Answers3