1

Very new to R and coding, and trying to do a frequency analysis on a long list of sentences and their given weighting. I've un-nested and mutated the data, but when I try to remove stop words, the sort order of words within each sentence gets randomized. I need to create bigrams later on, and would prefer if they're based on the original phrase.

Here's the relevant code, can provide more if insufficient:

library(dplyr)
library(tidytext)

data = data%>%
  anti_join(stop_words)%>%
  filter(!is.na(word))

What can I do to retain the original sort order within each sentence? I have all the words in a sentence indexed so I can match them to their given weight. Is there a better way to remove stop words that doesn't mess up the sort order?

Saw a similar question here but it's unresolved: How to stop anti_join from reversing sort order in R?

Also tried this but didn't work: dplyr How to sort groups within sorted groups?

Got help from a colleague in writing this but unfortunately they're not available anymore so any insight will be helpful. Thanks!

1 Answers1

0

You could add a sort-index to your data before sorting

library(dplyr)
library(tidytext)

data = data %>%
  dplyr::mutate(idx = 1:n()) %>%
  dplyr::anti_join(stop_words) %>%
  dplyr::filter(!is.na(word)) %>%
  dplyr::arrange(idx)

(the dplyr:: is not necessary, but helps you to remember where function comes from)

Stefan F
  • 2,573
  • 1
  • 17
  • 19
  • Tried this but my index is for the entire phrase itself rather than each word within it, so the sentence itself still gets scrambled. I want the bigrams to be created for the original phrase, so for example taking Trump's tweet of "Make America Great Again" rather than returning "Make America", "America Great" and "Great Again" my code returns bigrams like "Make Great" :( – shwarmashubs Aug 15 '17 at 16:42
  • could you post an example what your data looks like? Best would be a reproducible example so that we can play around with it. – Stefan F Aug 15 '17 at 17:38