ngrams analysis in tidytext in R

Question

I am trying to do ngram analysis for in tidytext, I have a corpus of 770 speeches. However the function unnest_tokens in tidytext takes data frame as input. when i checked with the example (jane austin books) each line of the book is stored as row in a data frame. i am not able to convert the corpus into a dataframe, neither for one speech at a time nor for all the corpus at once.

What is the way i can run ngrams (n=2,3, etc) analysis on tidytext using unnest tokens on my corpus. Can someone please suggest?

Thanks

Please create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and expected output. But to me it sounds as if you just need to use quanteda instead of tidytext. — phiver, Feb 14 '20 at 09:08

score 0 · Answer 1 · answered Feb 14 '20 at 06:35

0

You can use library ngram & tm for this.You can replace "myCorpus" with the corpus you created.

library(tm)
library(ngarm)
myCorpus<-c("Hi How are you","Hello World","I love Stackoverflow","Good Bye All")
ng <- ngram (myCorpus , n =2)
get.phrasetable (ng)

If you want to tokenize and convert your corpus into a dataframe then use the below code.

 tokenizedCorpus <- lapply(myCorpus, scan_tokenizer)
 mydata <- data.frame(text = sapply(tokenizedCorpus, paste, collapse = " "),stringsAsFactors = FALSE)

answered Feb 14 '20 at 06:35

Sarvagna Mahakali

26
3

hi, i knew about the same but that is not useful in my case as i do not want to do it for the whole corpus as one, as i mentioned i have 770 speeches and i wanted to work on each separately, and also i wanted to use tidytext for the same as there are otehr features of the same i wanted to use further to ngrams, so i wanted help with tidytxt feautre for ngrams – jalaj pathak Feb 14 '20 at 08:39
If you want to do it separately, you can try using a for loop to iterate over to create a separate corpus & ngrams for each speech. I will look into finding a solution for ngrams in tidy text package. – Sarvagna Mahakali Feb 14 '20 at 12:06

score 0 · Answer 2 · answered Feb 16 '20 at 00:36

You say that you have a "corpus" of 770 speeches. Do you mean you have a character vector? If so, you can tokenize your text in this way:

library(tidyverse)
library(tidytext)

speech_vec <- c("I am giving a speech!",
                "My second speech is even better.",
                "Unfortunately, this speech is terrible!",
                "For my final speech, I will wow you all.")

speech_df <- tibble(text = speech_vec) %>%
  mutate(speech = row_number())

tidy_speeches <- speech_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

tidy_speeches
#> # A tibble: 21 x 2
#>    speech bigram            
#>     <int> <chr>             
#>  1      1 i am              
#>  2      1 am giving         
#>  3      1 giving a          
#>  4      1 a speech          
#>  5      2 my second         
#>  6      2 second speech     
#>  7      2 speech is         
#>  8      2 is even           
#>  9      2 even better       
#> 10      3 unfortunately this
#> # … with 11 more rows

^{Created on 2020-02-15 by the reprex package (v0.3.0)}

If instead, you mean that you have a DocumentTermMatrix from the tm package, check out this chapter for details on how to convert to a tidy data structure.

hi, i figured out the same looking at the link a couple of days ago.thanks anyways — jalaj pathak, Feb 17 '20 at 06:01

ngrams analysis in tidytext in R

2 Answers2