Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
0
votes
0 answers

how to check if readtext fails to read part of a file

I am reading a text file with readtext(). It seems to be encoded in utf-8 (according to notepad++, am unable to verify); I am not sure if it is encoded correctly or if there are some mistakes/corruption. File size on disk according to windows…
user778806
  • 67
  • 6
0
votes
1 answer

how to read text files in quanteda, storing each line as a document

I have texts stored in several files. Within the files each line is a document (text of a blog post, text of a tweet Etc.). If I read using the readtext package in the default way shown in doc/examples the content of each file will be a single…
user778806
  • 67
  • 6
0
votes
1 answer

R: Quanteda: can I use textstat_keyness on two separate corpora?

the usage of "textstat_keyness" is the following: textstat_keyness(x, target = 1L, measure = c("chi2", "exact", "lr", "pmi"), sort = TRUE, correction = c("default", "yates", "williams", "none")) "target" is "the document index (numeric,…
Marina Santini
  • 99
  • 1
  • 3
  • 12
0
votes
0 answers

Using the French ANEW dictionary for sentiment analysis

Similarly to this post, I'm trying to use the Affective Norms for English Words (in French) for a sentiment analysis with Quanteda. I ultimately want to create a "mean sentiment" per text in my corpus. First, I load in the ANEW dictionary (FAN in…
Tristan G
  • 1
  • 2
0
votes
2 answers

Download multiple txt files R

I want to download a number of .txt-files. I have a data frame'"New_test in which the urls are under 'url' and the dest. names under 'code "New_test.txt" "url" "code" "1"…
Mel Schickel
  • 47
  • 1
  • 8
0
votes
0 answers

"MV" rescaling not working

I am trying to use wordscores on a corpus but when I use the "mv" rescaling the code fails to set as reference texts the ones I have selected. Besides, even though I establish -1 and 1 as reference values, it goes beyond them when rescaling. It…
Ion
  • 1
  • 1
0
votes
1 answer

Pairwise Distance between documents

I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix. A <- data.frame(name = c( "X-ray right leg arteries", "x-ray left shoulder", "x-ray leg arteries", "x-ray leg with 20km…
john
  • 1,026
  • 8
  • 19
0
votes
1 answer

Customized stopword list remove

I try to use a customized word list to remove phrases from text. This is a reproducable example. I think something it is not right with my attempt: mystop <- structure(list(stopwords = c("remove", "this line", "remove this line", "two lines")),…
user8831872
  • 383
  • 1
  • 14
0
votes
1 answer

couldn't install quanteda either directly or via source

I've tried directly install the package, its github version or from the source to no avail. This is the error message: During startup - Warning messages: 1: Setting LC_CTYPE failed, using "C" 2: Setting LC_TIME failed, using "C" 3: Setting…
santoku
  • 3,297
  • 7
  • 48
  • 76
0
votes
1 answer

Convert dfm to DocumentTermMatrix

Having a dataframe like this: df <- structure(list(text = c("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus…
HelenVcl
  • 59
  • 1
  • 1
  • 8
0
votes
2 answers

ntokens applied to VCorpus

I execute the followwing commands: library(tm) library(dplyr) library(stringi) library(quanteda) df <- structure(list(text = c("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis…
HelenVcl
  • 59
  • 1
  • 1
  • 8
0
votes
0 answers

R Regular expression to search citations of law using tidytext and tm

I use tidytext, tm and quantedafor text mining. I try to: filter a tibble with plain, processed text according to presence of a citation of law count the number of the same citation per text document Unfortunately, I am weak at using specific…
captcoma
  • 1,768
  • 13
  • 29
0
votes
1 answer

R quanteda library, error in corpus creation

I have a curious error which only happens in my colleagues RStudio when they run the code. The code is dealing with text corpus, and this is what I do: ap.corpus <- corpus(raw.data$text) ap.corpus #Corpus consisting of 214,226 documents and 0…
Nat
  • 19
  • 5
0
votes
1 answer

How do i fix "Error: could not find function "tokens"" in R (in RStudio)?

While learning R, I am asked to use the package "quanteda" and apply the function "tokens". Unfortunately, when I try to do so, I get the message Error: could not find function "tokens". But I can use, for example, "tokenize". My code is: …
Marko Karbevski
  • 137
  • 1
  • 12
0
votes
1 answer

Dictionary different output than the trial site version

I try to use LIWC dictonary 2015 version in R. A dummy text for text analysis: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes,…
cottinR
  • 181
  • 1
  • 1
  • 10