finding row-wise important words in text dataframe

Question

I have a dataframe which looks like this:

sentences <- data.frame(sentences = 
                          c('You can apply for or renew your Medical Assistance benefits online by using COMPASS.',
                            'COMPASS is the name of the website where you can apply for Medical Assistance and many other services that can help you make ends meet.',
                          'Medical tourism refers to people traveling to a country other than their own to obtain medical treatment. In the past this usually referred to those who traveled from less-developed countries to major medical centers in highly developed countries for treatment unavailable at home.',
                          'Health tourism is a wider term for travel that focus on medical treatments and the use of healthcare services. It covers a wide field of health-oriented, tourism ranging from preventive and health-conductive treatment to rehabilitational and curative forms of travel.',
                          'Medical tourism carries some risks that locally provided medical care either does not carry or carries to a much lesser degree.',
                          'Receiving medical care abroad may subject medical tourists to unfamiliar legal issues. The limited nature of litigation in various countries is a reason for accessbility of care overseas.', 
                          'While some countries currently presenting themselves as attractive medical tourism destinations provide some form of legal remedies for medical malpractice, these legal avenues may be unappealing to the medical tourist.'))

All I want to do is to find important words in each row and create a new column that should look like this:

sentences$ImpWords <- c("apply, renew, Medical, Assistance, benefits, online, COMPASS",
                    "COMPASS, name, website, apply, Medical, Assistance, services, help, meet") 

and so forth

I am not sure how this can be done?

I was trying bag of words, cleaning and preprocessing etc. using various packages such as tm, tidytext etc. But unable to get the desired result.

Is there any alternative possible?

You can use the udpipe R package (https://cran.r-project.org/package=udpipe). You can use that package to annotate your text data and next simply extract the relevant words corresponding to the part of speech tags you like (see e.g. https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html) — , Aug 06 '18 at 15:53

stevec · Accepted Answer · 2018-07-21T18:55:03.870

This will achieve what you're after. If you want to remove more words, simply find a bigger/different list (many are available through different packages). Here I've used tm's English stopwords.

library(tm)
stopwords <- stopwords('en')

sentences <- data.frame(sentences = 
                          c('You can apply for or renew your Medical Assistance benefits online by using COMPASS.',
                            'COMPASS is the name of the website where you can apply for Medical Assistance and many other services that can help you make ends meet.',
                            'Medical tourism refers to people traveling to a country other than their own to obtain medical treatment. In the past this usually referred to those who traveled from less-developed countries to major medical centers in highly developed countries for treatment unavailable at home.',
                            'Health tourism is a wider term for travel that focus on medical treatments and the use of healthcare services. It covers a wide field of health-oriented, tourism ranging from preventive and health-conductive treatment to rehabilitational and curative forms of travel.',
                            'Medical tourism carries some risks that locally provided medical care either does not carry or carries to a much lesser degree.',
                            'Receiving medical care abroad may subject medical tourists to unfamiliar legal issues. The limited nature of litigation in various countries is a reason for accessbility of care overseas.', 
                            'While some countries currently presenting themselves as attractive medical tourism destinations provide some form of legal remedies for medical malpractice, these legal avenues may be unappealing to the medical tourist.'))


sentences[,"sentences"] <- sentences[,"sentences"] %>% as.character()


ImpWords <- c()
for (i in 1:nrow(sentences)) {

  originalWords <- gsub('[[:punct:] ]+',' ',sentences[i, "sentences"]) %>% trimws(.) %>% strsplit(., " ") 
  lowerCaseWords <- gsub('[[:punct:] ]+',' ',tolower(sentences[i, "sentences"])) %>% trimws(.) %>% strsplit(., " ")
  wordsNotInStopWords <- originalWords[[1]][which(!lowerCaseWords[[1]] %in% stopwords)]
  wordsNotInStopWordsGreaterThanThreeChar <- wordsNotInStopWords[which(nchar(wordsNotInStopWords) > 3)]
  ImpWords[i] <- paste(wordsNotInStopWordsGreaterThanThreeChar, collapse = ", ")

}

sentences$ImpWords <- ImpWords
sentences$ImpWords

Awesome. But one suggestion, you should include the `stemming and lemmatization` for better filtering. and if possible, include only the words which are greater than 3 characters. — LeMarque, Jul 21 '18 at 18:28
@I_m_LeMarque I added a line to exclude words of 3 or fewer characters. You could use stemming/lemmatization which can turn something like `c("walking", "walked", "walk")` into `c("walk", "walk", "walk")` if that's what you're looking to do — stevec, Jul 21 '18 at 18:59
good suggestion, but it will be ok to work if dataset is small, but if dataset is huge, then suggested approach will not work. I am looking for that appraoch. Thanks for help. — LeMarque, Jul 21 '18 at 19:54

score 1 · Answer 2 · answered Aug 15 '18 at 00:10

Here is an approach using tidy data principles, if you'd like. One nice thing about this approach is that it is very flexible in its choice of stopword dictionary. You can switch them out via the argument to get_stopwords().

library(tidyverse)
library(tidytext)

sentences %>%
  mutate(line = row_number()) %>%
  unnest_tokens(word, sentences) %>%
  anti_join(get_stopwords(source = "smart")) %>%
  nest(word) %>%
  mutate(words = map(data, unlist),
         words = map_chr(words, paste, collapse = " "))

#> Joining, by = "word"
#> # A tibble: 7 x 3
#>    line data           words                                              
#>   <int> <list>         <chr>                                              
#> 1     1 <tibble [7 × … apply renew medical assistance benefits online com…
#> 2     2 <tibble [9 × … compass website apply medical assistance services …
#> 3     3 <tibble [23 ×… medical tourism refers people traveling country ob…
#> 4     4 <tibble [25 ×… health tourism wider term travel focus medical tre…
#> 5     5 <tibble [12 ×… medical tourism carries risks locally provided med…
#> 6     6 <tibble [18 ×… receiving medical care abroad subject medical tour…
#> 7     7 <tibble [17 ×… countries presenting attractive medical tourism de…

Created on 2018-08-14 by the reprex package (v0.2.0).

The first line makes a column to keep track of each sentence, then the next line uses unnest_tokens() to tokenize the text and transform it to a tidy format. You can then remove stopwords via anti_join(). After this, the last couple of lines are to transform from the tidy data format (which FYI does have the info you are looking for, just in a different format) to the data structure you talk about. You can remove the data column with select(-data) if you'd like.

Thank you so much Julia for this suggestion and approach. – LeMarque Aug 17 '18 at 10:37 — LeMarque, Aug 17 '18 at 10:37

finding row-wise important words in text dataframe

2 Answers2