Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

Keyword in context (kwic) for skipgrams?

I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is…

asked Jul 29 '20 at 09:50

Charlotte Siegmann

votes

1 answer

Identify WHICH words in a document have been matched by dictionary lookup and how many times

Quanteda question. For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much. Put differently, I want to get a matrix of the features in each…

r dictionary quanteda

asked Jul 20 '20 at 13:21

jules

votes

3 answers

how to extract ngrams from a text in R (newspaper articles)

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm: dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE) I am trying to…

r quanteda

asked Jun 05 '20 at 15:21

katwag97

votes

1 answer

Quanteda: How to look up patterns of two or more words in a phrase, when there can be any number of words in between?

I want to match some patterns in a text in R using the package {quanteda} and the tokens_lookup() function with the default valuetype="glob". The pattern would be the occurrence of one word in connection with a second word located anywhere in the…

r dictionary glob quanteda

asked May 28 '20 at 18:55

Dr. Fabian Habersack

1,111
12
30

votes

1 answer

Approximate string matching in R between two datasets

I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not: dt1 title genre Secret in Their Eyes…

r string-matching tm quanteda

asked Apr 17 '20 at 10:28

Carbo

votes

1 answer

Why does textstat_simil() with method "cosine" returns NA

I am computing cosine similarity over two dfm objects. One is my reference object which has dimensions 5 x 4,728 while the second dfm is my target object and has dimensions 2,325,329 x 40,595. What I don't understand is why textstat_simil() returns…

r cosine-similarity quanteda

asked Apr 14 '20 at 15:01

Francesco Grossetti

1,555
9
17

votes

1 answer

Searching for advanced regex patterns with kwic()

I want to use kwic() to find patterns in text with more advanced regex phrases, but I am struggling with the way kwic() is tokenising phrases and two problems evolved: 1) How to use grouping with phrases that contain whitespace: kwic(text, pattern…

r regex quanteda

asked Mar 31 '20 at 10:30

Sherls

votes

2 answers

Select phrases found in dictionary and return dataframe of doc_id and phrase

I have a dictionary file of medical phrases and a corpus of raw texts. I'm trying to use the dictionary file to select the relevant phrases from the text. Phrases, in this case, are 1 to 5-word n-grams. In the end, I would like the selected phrases…

r dictionary corpus quanteda

asked Mar 19 '20 at 15:11

Obed

votes

1 answer

How to convert DFM into dataframe BUT keeping docvars?

I am using the quanteda package and the very good tutorials that have been written about it to make various operations on paper articles. I obtained the frequency of specific words over time by selecting them in a mainwordsDFM and using…

r dataframe quanteda

asked Feb 26 '20 at 17:49

Maître Cheminade

votes

2 answers

Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions

I am using a dictionary to identify usage of a particular set of words in a corpus. I have included multi-word patterns in the dictionary, however, I don't think dfm_lookup (from the quanteda package) matches multi-word expressions. Does anyone know…

r dictionary nlp pattern-matching quanteda

asked Jan 23 '20 at 16:36

MCC89

votes

1 answer

Seeing metadata/docvars associated with STM topics

I am new to text analysis and am stuck on a question that doesn't seem to be answered in the documentation (or at least, I can't find it). I have created an STM in R from a Quanteda DfM which has docvars associated to it. The topics are based on…

r metadata quanteda

asked Oct 27 '19 at 07:26

Marina W.

votes

1 answer

textstat_keyness for POS, not words

textstat_keyness in Quanteda is used to compare the relative frequency of WORDS/LEMMAS in two (sub)corpora. But I want to compare parts of speech--not words. Then I want to plot it. I've been able to use textstat_keyness for words, no problem, using…

quanteda tagged-corpus

asked Oct 07 '19 at 17:41

dfayers

votes

1 answer

Why is Quanteda not removing words?

I am having trouble removing profanities from my n-grams. The getProfanityWords function below correctly creates a character vector. The whole script works in every other way, but the profanities remain. I did wonder whether it was to do with the…

r nlp text-mining quanteda tidytext

asked Aug 30 '19 at 13:40

Chris

1,449
1
18
39

votes

1 answer

Interpretation of dfm_weight(scheme='prop') with groups (quanteda)

I'm looking at the different weighting options using the dfm_weight. If I select scheme = 'prop' and I group textstat_frequency by location, what's the proper interpretation of a word in each group? Say in New York the term career is 0.6 and and in…

r quanteda

asked Jul 02 '19 at 16:25

Ted Mosby

1,426
1
16
41

votes

1 answer

Quanteda R: How to remove numbers or symbols "from"/"in" a token?

I have a question regarding the language pre-processing in quanteda R. I want to generate a document-feature matrix based on some documents. So, I generated a corpus and run the following code. data <- read.csv2("abstract.csv", stringsAsFactors =…

r quanteda

asked Jun 03 '19 at 17:39

Hu_Ca

Prev 1 2 3

…

41 42 Next