Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
4
votes
2 answers

Generating all word unigrams through trigrams in R

I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams. I expected to find an easy package for this, and have not succeeded. …
miratrix
  • 191
  • 2
  • 12
3
votes
1 answer

Replace quanteda tokens through regex

I would like to explicitly replace specific tokens defined in objects of class tokens of the package quanteda. I fail to replicate a standard approach that works well with stringr. The objective is to replace all tokens of the form "XXXof" in two…
Francesco Grossetti
  • 1,555
  • 9
  • 17
3
votes
1 answer

Merge two dataframe by rows using common words

df1 <- data.frame(freetext = c("open until monday night", "one more time to insert your coin"), numid = c(291,312)) df2 <- data.frame(freetext = c("open until night", "one time to insert your be"), aid = c(3,5)) I would line to merge the two…
foc
  • 947
  • 1
  • 9
  • 26
3
votes
1 answer

How to initialize second glove model with solution from first?

I am trying to implement one of the solutions to the question about How to align two GloVe models in text2vec?. I don't understand what are the proper values for input at GlobalVectors$new(..., init = list(w_i, w_j). How do I ensure the values for…
Ben
  • 41,615
  • 18
  • 132
  • 227
3
votes
1 answer

How to remove stopwords in multiple languages?

I have a corpus with two languages (the language information is saved in the docvar lang) and want to remove stopwords depending on the docvar value. I am using a substantively nonsensical example to illustrate the point (since in the example…
Ivo
  • 3,890
  • 5
  • 22
  • 53
3
votes
0 answers

Argument ngrams not used

I use quanteda for text analysis I use this commands corp_df2 <- tokens(df$text, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_remove(pattern = stopwords(source = "smart"))…
Nathalie
  • 1,228
  • 7
  • 20
3
votes
1 answer

quanteda: calculate text similarity by row between two DFMs

I have a data frame with 2 text fields: comment and the main post basically this is the structure id comment post_text 1 "I think that blabla.." "Why is blabla.." 2 "Well, you should…
Carbo
  • 906
  • 5
  • 23
3
votes
1 answer

How can I bootstrap text readability statistics using quanteda?

I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily obtain readability scores for each group with the…
beddotcom
  • 447
  • 3
  • 11
3
votes
1 answer

How to export a dictionary in LIWC dictionary format using R quanteda

In quanteda one can import LIWC format dictionaries. But is there a way to export a dictionary from quanteda to LIWC format? a sample of dictionary format for LIWC is below (the part between the % is the name of each category): % 462 Asentir 463…
useR
  • 179
  • 3
  • 13
3
votes
2 answers

R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters

I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix. I create the Corpus out of spanish text (with special characters), with no issues. However, when I create the…
Beep
  • 33
  • 4
3
votes
1 answer

Using half-space in R package quanteda

I am using the KWIC function in quanteda package in R to look up some phrases in Kurdish. In Kurdish, some compound words and phrases are separated by half-space. When I use a phrase including a half-space, R considers it as a typo(the red dot) and…
Ali
  • 45
  • 4
3
votes
2 answers

Logical combinations in quanteda dictionaries

I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words. For example: Teddybear = (fluffy AND adorable AND soft) Is this possible? I only found a solution yet to test for phrases…
Andreas
  • 707
  • 1
  • 6
  • 23
3
votes
2 answers

quanteda kwic to extract number followed by percentage

I have some text with phrases containing numbers, followed by a number of symbols. I want to extract them, for example, numbers followed by percentages. Using kwic function from quanteda package seems to work for numbers as regular expressions…
panchtox
  • 634
  • 7
  • 16
3
votes
2 answers

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text. I'm currently using the quanteda package…
kouta
  • 55
  • 6
3
votes
1 answer

Term document entropy calculation

Using dtm it is possible to take the term frequency. How is it possible or is there any easy way to calculate the entropy? It is giving higher weight to the terms with less frequency in some documents. entropy = 1 + (Σj pij log2(pij)/log2n pij =…
Airi
  • 43
  • 5
1
2
3
41 42