Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
2
votes
1 answer

R: How to count the total number of tokens in a corpus?

I have created a Quanteda corpus called readtext_corpus with 190 types of text. I would like to count the total number of tokens or words in the corpus. I tried the function ntoken which gives a number of words per text not the total number of words…
cd3091
  • 67
  • 7
2
votes
1 answer

Create keyword column with dictionary discarding longer matches

I am using tokens_lookup to see whether some texts contain the words in my dictionary discarding matches included in some pattern of words with nested_scope = "dictionary", as described in this answer. The idea is to discard longer dictionary…
Jasper
  • 95
  • 6
2
votes
1 answer

Measuring co-occurence patterns in media articles over time with Quanteda

I am trying to measure the number of times that different words co-occur with a particular term in collections of Chinese newspaper articles from each quarter of a year. To do this, I have been using Quanteda and written several R functions to run…
Nick Olczak
  • 305
  • 3
  • 14
2
votes
1 answer

Transform Two Column Data Frame into Quanteda Dictionary Format

My ultimate goal is to create a quanteda dictionary to use for topic classification on text data. However, my topic keywords are stored in a somewhat different format: I have a column of about 4000 keywords and a second column that specifies the…
Julian
  • 25
  • 3
2
votes
2 answers

How to transform a list of character vectors into a quanteda tokens object?

I have a list of character vectors that hold tokens for documents. list(doc1 = c("I", "like", "apples"), doc2 = c("You", "like", "apples", "too")) I would like to transform this vector into a quanteda tokens (or dfm) object in order to make use of…
jhfodr76
  • 95
  • 1
  • 7
2
votes
1 answer

GGWordcloud with gradient color / transparent words (GGPlot Wordcloud gradient with adjustcolor)

I have created a wordcloud with ggwordcloud, because unfortunately I can't use alternative wordcloud packages. I was able to customize ggwordcloud to my requirements so far, only unfortunately I miss the implementation of a gradient that fades into…
Alex_
  • 189
  • 8
2
votes
1 answer

Quanteda - creating a corpus from a dataframe with multiple documents

First question here, so apologises for any faux-pas. I have a dataframe in R of 657 observations with 4 variables. Each observation is a speech or interview by the Australian Prime Minister. So the variables are: date title URL txt (full text of…
2
votes
1 answer

how to create interactions with quanteda?

Consider the following example library(quanteda) library(tidyverse) tibble(text = c('the dog is growing tall', 'the grass is growing as well')) %>% corpus() %>% dfm() Document-feature matrix of: 2 documents, 8 features (31.2%…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
2
votes
1 answer

tokens_compound() in quanteda changes the order of features

I found tokens_compound() in quanteda changes the order of tokens across different R sessions. That is, the result varies every time after restarting a session even if a seed value is fixed, though it does not change in a single session. Here is the…
Shohei Doi
  • 23
  • 4
2
votes
1 answer

Regex pattern to count lines in poems with randomly \n or \n\n as line breaks

I need to count the lines of 221 poems and tried counting the line breaks \n. However, some lines have double line breaks \n\n to make a new verse. These I only want counted as one. The amount and position of double line breaks is random in each…
John
  • 109
  • 1
  • 8
2
votes
1 answer

Count certain letters in each document in a Quanteda corpus

Specifically, I need to count the frequencies of each vowel in each document: e and i as "high" vowels; a, o, and u as "low" vowels. Is there a way the count the frequencies of certain letters in each document in a quanteda corpus in R? So far, I…
John
  • 109
  • 1
  • 8
2
votes
1 answer

quanteda: dtm with new text and old vocabulary

I use quanteda to build a document term matrix: library(quanteda) mytext = "This is my old text" dtm <- dfm(mytext, tolower=T) convert(dtm,to="data.frame") Which yields: doc_id this is my old text 1 text1 1 1 1 1 1 I need to fit "new"…
Peter
  • 2,120
  • 2
  • 19
  • 33
2
votes
1 answer

R. Quanteda package. How to filter the values present in the dfm_tfidf?

So I have a dfm_tfidf and I want filter out values that are below a certain threshold. Code: dfmat2 <- matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3), byrow = TRUE, nrow = 2, dimnames = list(docs = c("document1", "document2"), …
Mig
  • 139
  • 2
  • 10
2
votes
2 answers

quanteda: remove tags (#,@) and url in on string

Consider the following string: txt <- ("Viele Dank für das Feedback + die Verbesserungsvorschläge! :) http://testurl.com/5lhk5p #Greenwashing #PR #Vattenfal") I create a dfm (Create a document-feature matrix) and pre-process the string as…
Apache
  • 21
  • 2
2
votes
3 answers

Keep only the text of a label

In a text which have formating labels such as data.frame(id = c(1, 2), text = c("something here

my text

also

Keep it

", "

title

another here")) How can someone keep with a comma separate option only the text exist inside in…
foc
  • 947
  • 1
  • 9
  • 26