how to extract ngrams from a text in R (newspaper articles)

Question

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm:

dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE)

I am trying to extract bigrams (e.g. "climate change", "global warming") but keep getting an error message when I type the following, saying the ngrams argument is not used.

dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE, ngrams = 2)

I have installed the tokenizer, tidyverse, dplyr, ngram, readtext, quanteda and stm libraries. Below is a screenshot of my corpus. Doc_iD is the article titles. I need the bigrams to be extracted from the "texts" column.

Do I need to extract the ngrams from the corpus first or can I do it from the dfm? Am I missing some piece of code that allows me to extract the bigrams?

score 2 · Answer 1 · answered Jun 05 '20 at 16:25

Strictly speaking, if ngrams are what you want, then you can use tokens_ngrams() to form them. But sounds like you rather get more interesting multi-word expressions than "of the" etc. For that, I would use textstat_collocations(). You will want to do this on tokens, not on a dfm - the dfm will have already split your tokens into bag of words features, from which ngrams or MWEs can no longer be formed.

Here's an example from the built-in inaugural corpus. It removes stopwords but leaves a "pad" so that words that were not adjacent before the stopword removal will not appear as adjacent after their removal.

library("quanteda")
## Package version: 2.0.1

toks <- tokens(data_corpus_inaugural) %>%
  tokens_remove(stopwords("en"), padding = TRUE)

colls <- textstat_collocations(toks)
head(colls)
##          collocation count count_nested length   lambda        z
## 1      united states   157            0      2 7.893348 41.19480
## 2             let us    97            0      2 6.291169 36.15544
## 3    fellow citizens    78            0      2 7.963377 32.93830
## 4    american people    40            0      2 4.426593 23.45074
## 5          years ago    26            0      2 7.896667 23.26947
## 6 federal government    32            0      2 5.312744 21.80345

These are by default scored and sorted in order of descending score.

To "extract" them, just take the collocation column:

head(colls$collocation, 50)
##  [1] "united states"         "let us"                "fellow citizens"      
##  [4] "american people"       "years ago"             "federal government"   
##  [7] "almighty god"          "general government"    "fellow americans"     
## [10] "go forward"            "every citizen"         "chief justice"        
## [13] "four years"            "god bless"             "one another"          
## [16] "state governments"     "political parties"     "foreign nations"      
## [19] "solemn oath"           "public debt"           "religious liberty"    
## [22] "public money"          "domestic concerns"     "national life"        
## [25] "future generations"    "two centuries"         "social order"         
## [28] "passed away"           "good faith"            "move forward"         
## [31] "earnest desire"        "naval force"           "executive department" 
## [34] "best interests"        "human dignity"         "public expenditures"  
## [37] "public officers"       "domestic institutions" "tariff bill"          
## [40] "first time"            "race feeling"          "western hemisphere"   
## [43] "upon us"               "civil service"         "nuclear weapons"      
## [46] "foreign affairs"       "executive branch"      "may well"             
## [49] "state authorities"     "highest degree"

Hi Ken, I'll certainly defer to your expertise on your package LOL, but I think he's trying to search for two specific bigrams across all his documents. **I think** — Chuck P, Jun 05 '20 at 16:29
Hi Ken and Chuck, I’m currently still in early stages of familiarising myself with the data and how to specify certain things - so both your ways (if I can get them to work) will be really helpful for me, so thank you! I think initially Id want to do what Ken suggested but later on I may need to specify specific bigrams. I will try out both ways and see how I get on. Thank you for your help! — katwag97, Jun 05 '20 at 16:43

score 1 · Answer 2 · answered Jun 05 '20 at 15:40

1

I think you need to create the ngram directly from the corpus. This is an example adapted from the quanteda tutorial website:

library(quanteda)
corp <- corpus(data_corpus_inaugural)
toks <- tokens(corp)

tokens_ngrams(toks, n = 2)

Tokens consisting of 58 documents and 4 docvars.
1789-Washington :
 [1] "Fellow-Citizens_of" "of_the"             "the_Senate"         "Senate_and"         "and_of"             "of_the"             "the_House"         
 [8] "House_of"           "of_Representatives" "Representatives_:"  ":_Among"            "Among_the"         
[ ... and 1,524 more ]

answered Jun 05 '20 at 15:40

Ahorn

3,686
1
10
17

Hi, thanks for answering! Is there a way to specific not to include stop words in this? – katwag97 Jun 05 '20 at 15:55
yes, like this: `toks_nostop <- tokens_select(toks, pattern = stopwords('en'), selection = 'remove')` – Ahorn Jun 05 '20 at 15:58
One note: if you want to remove tokens, you may want to use the option `padding = T` so that you can keep the empty slot for the removed tokens. Otherwise, two tokens with removed tokens in-between will be concatenated. e.g. "The president was right" -> "president_right" – amatsuo_net Jun 05 '20 at 16:01

Chuck P · Answer 3 · 2020-06-05T16:25:29.690

EDITED Hi this example from the help dfm may be useful

library(quanteda)


# You say you're already creating the corpus?
# where it says "data_corpus_inaugaral" put your corpus name

# Where is says "the_senate" put "climate change"
# where is says "the_house" put "global_warming"

tokens(data_corpus_inaugural) %>%
  tokens_ngrams(n = 2) %>%
  dfm(stem = TRUE, select = c("the_senate", "the_house"))

#> Document-feature matrix of: 58 documents, 2 features (89.7% sparse) and 4 docvars.
#>                  features
#> docs              the_senat the_hous
#>   1789-Washington         1        2
#>   1793-Washington         0        0
#>   1797-Adams              0        0
#>   1801-Jefferson          0        0
#>   1805-Jefferson          0        0
#>   1809-Madison            0        0
#> [ reached max_ndoc ... 52 more documents ]

Hi, thanks for answering! I dont think this is achieving what I want it to but I will play around with it, thanks! — katwag97, Jun 05 '20 at 15:54

how to extract ngrams from a text in R (newspaper articles)

3 Answers3