Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

R: How to count the total number of tokens in a corpus?

I have created a Quanteda corpus called readtext_corpus with 190 types of text. I would like to count the total number of tokens or words in the corpus. I tried the function ntoken which gives a number of words per text not the total number of words…

r nlp corpus quanteda

asked Feb 01 '22 at 00:13

cd3091

votes

1 answer

Create keyword column with dictionary discarding longer matches

I am using tokens_lookup to see whether some texts contain the words in my dictionary discarding matches included in some pattern of words with nested_scope = "dictionary", as described in this answer. The idea is to discard longer dictionary…

r quanteda

asked Sep 16 '21 at 09:19

Jasper

votes

1 answer

Measuring co-occurence patterns in media articles over time with Quanteda

I am trying to measure the number of times that different words co-occur with a particular term in collections of Chinese newspaper articles from each quarter of a year. To do this, I have been using Quanteda and written several R functions to run…

r nlp quanteda

asked Aug 12 '21 at 20:35

Nick Olczak

votes

1 answer

Transform Two Column Data Frame into Quanteda Dictionary Format

My ultimate goal is to create a quanteda dictionary to use for topic classification on text data. However, my topic keywords are stored in a somewhat different format: I have a column of about 4000 keywords and a second column that specifies the…

r dictionary transformation quanteda

asked Aug 12 '21 at 14:04

Julian

votes

2 answers

How to transform a list of character vectors into a quanteda tokens object?

I have a list of character vectors that hold tokens for documents. list(doc1 = c("I", "like", "apples"), doc2 = c("You", "like", "apples", "too")) I would like to transform this vector into a quanteda tokens (or dfm) object in order to make use of…

r quanteda

asked Jul 18 '21 at 22:24

jhfodr76

votes

1 answer

GGWordcloud with gradient color / transparent words (GGPlot Wordcloud gradient with adjustcolor)

I have created a wordcloud with ggwordcloud, because unfortunately I can't use alternative wordcloud packages. I was able to customize ggwordcloud to my requirements so far, only unfortunately I miss the implementation of a gradient that fades into…

r ggplot2 colors word-cloud quanteda

asked Jul 08 '21 at 07:04

Alex_

votes

1 answer

Quanteda - creating a corpus from a dataframe with multiple documents

First question here, so apologises for any faux-pas. I have a dataframe in R of 657 observations with 4 variables. Each observation is a speech or interview by the Australian Prime Minister. So the variables are: date title URL txt (full text of…

r corpus quanteda

asked Apr 08 '21 at 07:04

Daniel Casey

votes

1 answer

how to create interactions with quanteda?

Consider the following example library(quanteda) library(tidyverse) tibble(text = c('the dog is growing tall', 'the grass is growing as well')) %>% corpus() %>% dfm() Document-feature matrix of: 2 documents, 8 features (31.2%…

r quanteda

asked Mar 18 '21 at 14:03

ℕʘʘḆḽḘ

18,566
34
128
235

votes

1 answer

tokens_compound() in quanteda changes the order of features

I found tokens_compound() in quanteda changes the order of tokens across different R sessions. That is, the result varies every time after restarting a session even if a seed value is fixed, though it does not change in a single session. Here is the…

r quanteda

asked Feb 18 '21 at 08:44

Shohei Doi

votes

1 answer

Regex pattern to count lines in poems with randomly \n or \n\n as line breaks

I need to count the lines of 221 poems and tried counting the line breaks \n. However, some lines have double line breaks \n\n to make a new verse. These I only want counted as one. The amount and position of double line breaks is random in each…

r regex nlp data-science quanteda

asked Dec 15 '20 at 11:05

John

votes

1 answer

Count certain letters in each document in a Quanteda corpus

Specifically, I need to count the frequencies of each vowel in each document: e and i as "high" vowels; a, o, and u as "low" vowels. Is there a way the count the frequencies of certain letters in each document in a quanteda corpus in R? So far, I…

r data-science quanteda

asked Dec 09 '20 at 09:28

John

votes

1 answer

quanteda: dtm with new text and old vocabulary

I use quanteda to build a document term matrix: library(quanteda) mytext = "This is my old text" dtm <- dfm(mytext, tolower=T) convert(dtm,to="data.frame") Which yields: doc_id this is my old text 1 text1 1 1 1 1 1 I need to fit "new"…

r quanteda

asked Nov 11 '20 at 16:57

Peter

2,120
2
19
33

votes

1 answer

R. Quanteda package. How to filter the values present in the dfm_tfidf?

So I have a dfm_tfidf and I want filter out values that are below a certain threshold. Code: dfmat2 <- matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3), byrow = TRUE, nrow = 2, dimnames = list(docs = c("document1", "document2"), …

r quanteda

asked Sep 15 '20 at 13:14

Mig

votes

2 answers

quanteda: remove tags (#,@) and url in on string

Consider the following string: txt <- ("Viele Dank für das Feedback + die Verbesserungsvorschläge! :) http://testurl.com/5lhk5p #Greenwashing #PR #Vattenfal") I create a dfm (Create a document-feature matrix) and pre-process the string as…

r twitter corpus quanteda dfm

asked Sep 09 '20 at 13:51

Apache

votes

3 answers

Keep only the text of a label

In a text which have formating labels such as data.frame(id = c(1, 2), text = c("something here

my text

also

Keep it

", "

title

another here")) How can someone keep with a comma separate option only the text exist inside in…

r quanteda

asked Aug 04 '20 at 18:05

foc

Prev 1 2 3

…

41 42 Next