Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

2 answers

Generating all word unigrams through trigrams in R

I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams. I expected to find an easy package for this, and have not succeeded. …

r text-processing tm rweka quanteda

asked Jul 08 '15 at 00:17

miratrix

votes

1 answer

Replace quanteda tokens through regex

I would like to explicitly replace specific tokens defined in objects of class tokens of the package quanteda. I fail to replicate a standard approach that works well with stringr. The objective is to replace all tokens of the form "XXXof" in two…

r regex quanteda

asked Mar 16 '21 at 08:19

Francesco Grossetti

1,555
9
17

votes

1 answer

Merge two dataframe by rows using common words

df1 <- data.frame(freetext = c("open until monday night", "one more time to insert your coin"), numid = c(291,312)) df2 <- data.frame(freetext = c("open until night", "one time to insert your be"), aid = c(3,5)) I would line to merge the two…

r quanteda

asked Jul 05 '20 at 10:44

foc

votes

1 answer

How to initialize second glove model with solution from first?

I am trying to implement one of the solutions to the question about How to align two GloVe models in text2vec?. I don't understand what are the proper values for input at GlobalVectors$new(..., init = list(w_i, w_j). How do I ensure the values for…

r matrix nlp word2vec quanteda

asked Apr 10 '20 at 18:24

Ben

41,615
18
132
227

votes

1 answer

How to remove stopwords in multiple languages?

I have a corpus with two languages (the language information is saved in the docvar lang) and want to remove stopwords depending on the docvar value. I am using a substantively nonsensical example to illustrate the point (since in the example…

r quanteda

asked Nov 01 '19 at 17:49

Ivo

3,890
5
22
53

votes

0 answers

Argument ngrams not used

I use quanteda for text analysis I use this commands corp_df2 <- tokens(df$text, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_remove(pattern = stopwords(source = "smart"))…

r quanteda

asked Oct 06 '19 at 22:31

Nathalie

1,228
7
20

votes

1 answer

quanteda: calculate text similarity by row between two DFMs

I have a data frame with 2 text fields: comment and the main post basically this is the structure id comment post_text 1 "I think that blabla.." "Why is blabla.." 2 "Well, you should…

r nlp similarity quanteda

asked Apr 11 '19 at 10:06

Carbo

votes

1 answer

How can I bootstrap text readability statistics using quanteda?

I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily obtain readability scores for each group with the…

r nlp tm quanteda statistics-bootstrap

asked Mar 14 '19 at 19:08

beddotcom

votes

1 answer

How to export a dictionary in LIWC dictionary format using R quanteda

In quanteda one can import LIWC format dictionaries. But is there a way to export a dictionary from quanteda to LIWC format? a sample of dictionary format for LIWC is below (the part between the % is the name of each category): % 462 Asentir 463…

r dictionary quanteda

asked Jul 17 '18 at 00:25

useR

votes

2 answers

R Spanish Term Frequency Matrix with TD and Quanteda Spanish Characters

I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix. I create the Corpus out of spanish text (with special characters), with no issues. However, when I create the…

r special-characters encode quanteda

asked Apr 26 '18 at 02:21

Beep

votes

1 answer

Using half-space in R package quanteda

I am using the KWIC function in quanteda package in R to look up some phrases in Kurdish. In Kurdish, some compound words and phrases are separated by half-space. When I use a phrase including a half-space, R considers it as a typo(the red dot) and…

r quanteda

asked Apr 22 '18 at 23:56

Ali

votes

2 answers

Logical combinations in quanteda dictionaries

I'm using the quanteda dictionary lookup. I'll try to formulate entries where i can lookup logical combinations of words. For example: Teddybear = (fluffy AND adorable AND soft) Is this possible? I only found a solution yet to test for phrases…

r quanteda

asked Apr 17 '18 at 08:05

Andreas

votes

2 answers

quanteda kwic to extract number followed by percentage

I have some text with phrases containing numbers, followed by a number of symbols. I want to extract them, for example, numbers followed by percentages. Using kwic function from quanteda package seems to work for numbers as regular expressions…

r regex quanteda

asked Apr 11 '18 at 00:26

panchtox

votes

2 answers

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text. I'm currently using the quanteda package…

r character gsub topic-modeling quanteda

asked Mar 01 '18 at 23:25

kouta

votes

1 answer

Term document entropy calculation

Using dtm it is possible to take the term frequency. How is it possible or is there any easy way to calculate the entropy? It is giving higher weight to the terms with less frequency in some documents. entropy = 1 + (Σj pij log2(pij)/log2n pij =…

r term-document-matrix quanteda

asked Feb 10 '18 at 19:11

Airi

Prev 1

…

41 42 Next