Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
2
votes
1 answer

Remove words from a dtm

I have created a dtm. library(tm) corpus = Corpus(VectorSource(dat$Reviews)) dtm = DocumentTermMatrix(corpus) I used it to remove rare terms. dtm = removeSparseTerms(dtm, 0.98) After removeSparseTermsthere are still some terms in the dtm which…
Banjo
  • 1,191
  • 1
  • 11
  • 28
2
votes
1 answer

Is there an R function for finding keywords within a certain 'word distance'?

What I need is a function to find words within a certain 'word distance'. The words 'bag' and 'tool' are interesting in a sentence "He had a bag of tools in his car." With the Quanteda kwic function I can find 'bag' and 'tool' individually, but…
2
votes
1 answer

Create custom dictionary from character vector

I am trying to look for specific words in corpus using dfm_lookup(). I am really struggling with the dictionaries needed for the dfm_loopup(). I created a character vector named "words" which contains all the words that should go into the…
BanffBoss122
  • 149
  • 9
2
votes
2 answers

how to use quanteda on aggregated data?

Consider this example tibble(text = c('a grande latte with soy milk', 'black coffee no room'), repetition = c(100, 2)) # A tibble: 2 x 2 text repetition 1 a…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
2
votes
3 answers

How to remove single and double char tokens using quanteda::tokens_select()

I am trying to remove single and double char tokens. here is an example: toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE) toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection =…
ronencozen
  • 1,991
  • 1
  • 15
  • 26
2
votes
1 answer

substituting several ngrams in quanteda

In my text of news articles I would like to convert several different ngrams that refer to the same political party to an acronym. I would like to do this because I would like to avoid any sentiment dictionaries confusing the words in the party's…
spindoctor
  • 1,719
  • 1
  • 18
  • 42
2
votes
0 answers

textmodel wordfish in quanteda

I' am trying to understand the logic of ideological scaling. I have a dataset consisting of monetary and fiscal policy related texts with a dimension t (=time), and j (institution). I would like to scale the texts using wordfish. example of quanteda…
2
votes
0 answers

Beginner advice about adding start/end sentence markers: using Quanteda functionalities versus doing it manually (custom code)

I need to add begin and end sentence markers to some texts that I analyze using Quanteda. I would like to add these markers using Quanteda but I do not see an explicit way to do that "out of the box". Searching for an answer I found a different…
user778806
  • 67
  • 6
2
votes
1 answer

How to combine corpus documents

The example below is a list of 14 texts within a corpus. The corpus consists of 14 documents. I am trying to find a way to combine all the texts into one document. Then, the corpus would consist of 1 document rather than 14.
2
votes
2 answers

Keep only sentences in corpus that contain specific key words (in R)

I have a corpus with .txt documents. From these .txt documents, I do not need all sentences, but I only want to keep certain sentences that contain specific key words. From there on, I will perform similarity measures etc. So, here is an…
vewees
  • 37
  • 6
2
votes
1 answer

R: problems applying LIME to quanteda text model

it's a modified version of my previous question: I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer…
Kasia Kulma
  • 1,683
  • 1
  • 14
  • 39
2
votes
0 answers

Lots of empty whitespace in textplot_wordcloud / comparison.cloud

I have a shiny app that plots a wordcloud of terms and trying to get it to fit with labels inside the app is difficult. If I expand the screen everything fits but if I don't labels get cut off and it's because of all that extra whitespace that is…
Ted Mosby
  • 1,426
  • 1
  • 16
  • 41
2
votes
4 answers

A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by…
Jacek Kotowski
  • 620
  • 16
  • 49
2
votes
1 answer

Can the ANEW dictionary be used for sentiment analysis in quanteda?

I am trying to find a way to implement the Affective Norms for English Words (in dutch) for a longitudinal sentiment analysis with Quanteda. What I ultimately want to have is a "mean sentiment" per year in order to show any longitudinal trends. In…
Daniel Hansen
  • 35
  • 1
  • 8
2
votes
2 answers

How to calculate proximity of words to a specific term in a document

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me…
DHranger
  • 33
  • 5