Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

0 answers

how to check if readtext fails to read part of a file

I am reading a text file with readtext(). It seems to be encoded in utf-8 (according to notepad++, am unable to verify); I am not sure if it is encoded correctly or if there are some mistakes/corruption. File size on disk according to windows…

asked Apr 15 '18 at 17:10

user778806

votes

1 answer

how to read text files in quanteda, storing each line as a document

I have texts stored in several files. Within the files each line is a document (text of a blog post, text of a tweet Etc.). If I read using the readtext package in the default way shown in doc/examples the content of each file will be a single…

r nlp quanteda

asked Apr 07 '18 at 10:56

user778806

votes

1 answer

R: Quanteda: can I use textstat_keyness on two separate corpora?

the usage of "textstat_keyness" is the following: textstat_keyness(x, target = 1L, measure = c("chi2", "exact", "lr", "pmi"), sort = TRUE, correction = c("default", "yates", "williams", "none")) "target" is "the document index (numeric,…

r quanteda

asked Apr 05 '18 at 08:30

Marina Santini

votes

0 answers

Using the French ANEW dictionary for sentiment analysis

Similarly to this post, I'm trying to use the Affective Norms for English Words (in French) for a sentiment analysis with Quanteda. I ultimately want to create a "mean sentiment" per text in my corpus. First, I load in the ANEW dictionary (FAN in…

r quanteda

asked Mar 24 '18 at 17:02

Tristan G

votes

2 answers

Download multiple txt files R

I want to download a number of .txt-files. I have a data frame'"New_test in which the urls are under 'url' and the dest. names under 'code "New_test.txt" "url" "code" "1"…

r download quanteda

asked Mar 20 '18 at 15:04

Mel Schickel

votes

0 answers

"MV" rescaling not working

I am trying to use wordscores on a corpus but when I use the "mv" rescaling the code fails to set as reference texts the ones I have selected. Besides, even though I establish -1 and 1 as reference values, it goes beyond them when rescaling. It…

r quanteda

asked Feb 27 '18 at 15:47

Ion

votes

1 answer

Pairwise Distance between documents

I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix. A <- data.frame(name = c( "X-ray right leg arteries", "x-ray left shoulder", "x-ray leg arteries", "x-ray leg with 20km…

r quanteda

asked Feb 17 '18 at 19:22

john

1,026
8
19

votes

1 answer

Customized stopword list remove

I try to use a customized word list to remove phrases from text. This is a reproducable example. I think something it is not right with my attempt: mystop <- structure(list(stopwords = c("remove", "this line", "remove this line", "two lines")),…

r quanteda

asked Feb 01 '18 at 21:11

user8831872

votes

1 answer

couldn't install quanteda either directly or via source

I've tried directly install the package, its github version or from the source to no avail. This is the error message: During startup - Warning messages: 1: Setting LC_CTYPE failed, using "C" 2: Setting LC_TIME failed, using "C" 3: Setting…

r text nlp quanteda

asked Jan 31 '18 at 08:34

santoku

3,297
7
48
76

votes

1 answer

Convert dfm to DocumentTermMatrix

Having a dataframe like this: df <- structure(list(text = c("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus…

r quanteda

asked Jan 28 '18 at 12:08

HelenVcl

votes

2 answers

ntokens applied to VCorpus

I execute the followwing commands: library(tm) library(dplyr) library(stringi) library(quanteda) df <- structure(list(text = c("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis…

r tm quanteda

asked Jan 27 '18 at 11:03

HelenVcl

votes

0 answers

R Regular expression to search citations of law using tidytext and tm

I use tidytext, tm and quantedafor text mining. I try to: filter a tibble with plain, processed text according to presence of a citation of law count the number of the same citation per text document Unfortunately, I am weak at using specific…

r regex tm quanteda tidytext

asked Jan 13 '18 at 20:10

captcoma

1,768
13
29

votes

1 answer

R quanteda library, error in corpus creation

I have a curious error which only happens in my colleagues RStudio when they run the code. The code is dealing with text corpus, and this is what I do: ap.corpus <- corpus(raw.data$text) ap.corpus #Corpus consisting of 214,226 documents and 0…

r corpus quanteda

asked Jan 05 '18 at 16:20

Nat

votes

1 answer

How do i fix "Error: could not find function "tokens"" in R (in RStudio)?

While learning R, I am asked to use the package "quanteda" and apply the function "tokens". Unfortunately, when I try to do so, I get the message Error: could not find function "tokens". But I can use, for example, "tokenize". My code is: …

r quanteda

asked Dec 22 '17 at 12:52

Marko Karbevski

votes

1 answer

Dictionary different output than the trial site version

I try to use LIWC dictonary 2015 version in R. A dummy text for text analysis: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes,…

r quanteda

asked Dec 03 '17 at 11:03

cottinR

Prev 1 2 3

…

41 42 Next