0

I am trying to do ngram analysis for in tidytext, I have a corpus of 770 speeches. However the function unnest_tokens in tidytext takes data frame as input. when i checked with the example (jane austin books) each line of the book is stored as row in a data frame. i am not able to convert the corpus into a dataframe, neither for one speech at a time nor for all the corpus at once.

What is the way i can run ngrams (n=2,3, etc) analysis on tidytext using unnest tokens on my corpus. Can someone please suggest?

Thanks

smci
  • 32,567
  • 20
  • 113
  • 146
jalaj pathak
  • 67
  • 1
  • 8
  • Please create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and expected output. But to me it sounds as if you just need to use quanteda instead of tidytext. – phiver Feb 14 '20 at 09:08

2 Answers2

0

You can use library ngram & tm for this.You can replace "myCorpus" with the corpus you created.

library(tm)
library(ngarm)
myCorpus<-c("Hi How are you","Hello World","I love Stackoverflow","Good Bye All")
ng <- ngram (myCorpus , n =2)
get.phrasetable (ng)

If you want to tokenize and convert your corpus into a dataframe then use the below code.

 tokenizedCorpus <- lapply(myCorpus, scan_tokenizer)
 mydata <- data.frame(text = sapply(tokenizedCorpus, paste, collapse = " "),stringsAsFactors = FALSE)
  • hi, i knew about the same but that is not useful in my case as i do not want to do it for the whole corpus as one, as i mentioned i have 770 speeches and i wanted to work on each separately, and also i wanted to use tidytext for the same as there are otehr features of the same i wanted to use further to ngrams, so i wanted help with tidytxt feautre for ngrams – jalaj pathak Feb 14 '20 at 08:39
  • If you want to do it separately, you can try using a for loop to iterate over to create a separate corpus & ngrams for each speech. I will look into finding a solution for ngrams in tidy text package. – Sarvagna Mahakali Feb 14 '20 at 12:06
0

You say that you have a "corpus" of 770 speeches. Do you mean you have a character vector? If so, you can tokenize your text in this way:

library(tidyverse)
library(tidytext)

speech_vec <- c("I am giving a speech!",
                "My second speech is even better.",
                "Unfortunately, this speech is terrible!",
                "For my final speech, I will wow you all.")

speech_df <- tibble(text = speech_vec) %>%
  mutate(speech = row_number())

tidy_speeches <- speech_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

tidy_speeches
#> # A tibble: 21 x 2
#>    speech bigram            
#>     <int> <chr>             
#>  1      1 i am              
#>  2      1 am giving         
#>  3      1 giving a          
#>  4      1 a speech          
#>  5      2 my second         
#>  6      2 second speech     
#>  7      2 speech is         
#>  8      2 is even           
#>  9      2 even better       
#> 10      3 unfortunately this
#> # … with 11 more rows

Created on 2020-02-15 by the reprex package (v0.3.0)

If instead, you mean that you have a DocumentTermMatrix from the tm package, check out this chapter for details on how to convert to a tidy data structure.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48