Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

Remove words from a dtm

I have created a dtm. library(tm) corpus = Corpus(VectorSource(dat$Reviews)) dtm = DocumentTermMatrix(corpus) I used it to remove rare terms. dtm = removeSparseTerms(dtm, 0.98) After removeSparseTermsthere are still some terms in the dtm which…

r text tm quanteda

asked Apr 24 '19 at 14:45

Banjo

1,191
1
11
28

votes

1 answer

Is there an R function for finding keywords within a certain 'word distance'?

What I need is a function to find words within a certain 'word distance'. The words 'bag' and 'tool' are interesting in a sentence "He had a bag of tools in his car." With the Quanteda kwic function I can find 'bag' and 'tool' individually, but…

r quanteda

asked Apr 04 '19 at 06:39

Willem Gooijaers

votes

1 answer

Create custom dictionary from character vector

I am trying to look for specific words in corpus using dfm_lookup(). I am really struggling with the dictionaries needed for the dfm_loopup(). I created a character vector named "words" which contains all the words that should go into the…

r text-mining quanteda

asked Mar 18 '19 at 09:24

BanffBoss122

votes

2 answers

how to use quanteda on aggregated data?

Consider this example tibble(text = c('a grande latte with soy milk', 'black coffee no room'), repetition = c(100, 2)) # A tibble: 2 x 2 text repetition 1 a…

r quanteda

asked Feb 15 '19 at 15:34

ℕʘʘḆḽḘ

18,566
34
128
235

votes

3 answers

How to remove single and double char tokens using quanteda::tokens_select()

I am trying to remove single and double char tokens. here is an example: toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE) toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection =…

r quanteda

asked Feb 09 '19 at 17:07

ronencozen

1,991
1
15
26

votes

1 answer

substituting several ngrams in quanteda

In my text of news articles I would like to convert several different ngrams that refer to the same political party to an acronym. I would like to do this because I would like to avoid any sentiment dictionaries confusing the words in the party's…

r text-mining quanteda

asked Oct 05 '18 at 14:12

spindoctor

1,719
1
18
42

votes

0 answers

textmodel wordfish in quanteda

I' am trying to understand the logic of ideological scaling. I have a dataset consisting of monetary and fiscal policy related texts with a dimension t (=time), and j (institution). I would like to scale the texts using wordfish. example of quanteda…

quanteda

asked Aug 06 '18 at 09:32

Ulrich Fritsche

votes

0 answers

Beginner advice about adding start/end sentence markers: using Quanteda functionalities versus doing it manually (custom code)

I need to add begin and end sentence markers to some texts that I analyze using Quanteda. I would like to add these markers using Quanteda but I do not see an explicit way to do that "out of the box". Searching for an answer I found a different…

regex nlp quanteda text2vec

asked Aug 01 '18 at 07:18

user778806

votes

1 answer

How to combine corpus documents

The example below is a list of 14 texts within a corpus. The corpus consists of 14 documents. I am trying to find a way to combine all the texts into one document. Then, the corpus would consist of 1 document rather than 14.

quanteda

asked Jun 23 '18 at 13:32

Nicholas Bradley

votes

2 answers

Keep only sentences in corpus that contain specific key words (in R)

I have a corpus with .txt documents. From these .txt documents, I do not need all sentences, but I only want to keep certain sentences that contain specific key words. From there on, I will perform similarity measures etc. So, here is an…

r nlp text-mining corpus quanteda

asked Jun 13 '18 at 15:55

vewees

votes

1 answer

R: problems applying LIME to quanteda text model

it's a modified version of my previous question: I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer…

r text text-classification quanteda lime

asked May 11 '18 at 10:57

Kasia Kulma

1,683
1
14
39

votes

0 answers

Lots of empty whitespace in textplot_wordcloud / comparison.cloud

I have a shiny app that plots a wordcloud of terms and trying to get it to fit with labels inside the app is difficult. If I expand the screen everything fits but if I don't labels get cut off and it's because of all that extra whitespace that is…

r shiny word-cloud quanteda

asked Feb 15 '18 at 16:12

Ted Mosby

1,426
1
16
41

votes

4 answers

A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by…

r text-mining tm quanteda text2vec

asked Sep 08 '17 at 18:30

Jacek Kotowski

votes

1 answer

Can the ANEW dictionary be used for sentiment analysis in quanteda?

I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends. In…

r nlp sentiment-analysis quanteda

asked May 23 '17 at 10:32

Daniel Hansen

votes

2 answers

How to calculate proximity of words to a specific term in a document

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me…

r tm quanteda

asked May 18 '17 at 20:57

DHranger

Prev 1 2 3

…

41 42 Next