I am doing some text analysis with some free text data with tidytext. Consider a sample sentences:
"The quick brown fox jumps over the lazy dog"
"I love books"
My token approach using tidytext:
unigrams = tweet_text %>%
unnest_tokens(output = word, input = txt) %>%
anti_join(stop_words)
Results in the following:
The
quick
brown
fox
jumps
over
the
lazy
dog
I now need to join every unigram back to its original sentence:
"The quick brown fox jumps over the lazy dog" | The
"The quick brown fox jumps over the lazy dog" | quick
"The quick brown fox jumps over the lazy dog" | brown
"The quick brown fox jumps over the lazy dog" | fox
"The quick brown fox jumps over the lazy dog" | jumps
"The quick brown fox jumps over the lazy dog" | over
"The quick brown fox jumps over the lazy dog" | the
"The quick brown fox jumps over the lazy dog" | lazy
"The quick brown fox jumps over the lazy dog" | dog
"I love books" | I
"I love books" | love
"I love books | books
I'm a bit stuck. The solution needs to scale for thousands of sentences. I thought some function like this might be native to tidytext, but haven't found anything yet.