Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
0
votes
1 answer

Computing cosine similarities on a large corpus in R using quanteda

I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the…
ModalBro
  • 544
  • 5
  • 25
0
votes
1 answer

R Text Mining with quanteda

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File #…
Daniel
  • 137
  • 2
  • 9
0
votes
3 answers

How to create wordclouds for text files in a directory in R

I am trying to create a wordcloud for each text file in a directory. They are four presidential announcement speeches. I keep getting the following message: > cname <- file.path("C:", "texts") > cname [1] "C:/texts" > cname <-…
-1
votes
1 answer

How to REMOVE lower case tokens with R?

I'm using R/Quanteda and I'm trying to make a wordcloud from ONLY upper case words. The txt is a from a bibliographic reference in ABNT format, doing so I would keep only the authors surnames. Any hint? Tanks!
-1
votes
1 answer

how to find the element in between two elements in a character vector created by an rtf document

I have an object created from an rtf document using the code:sample_doc <- read_rtf("sample.doc") (I had to use read_rtf because the document is actually an rtf). I know somewhere in the document there are two phrases (an element in the character…
sli1991
  • 11
  • 2
-1
votes
1 answer

R: Applying quanteda's textstat_readability function producing "Error in set"

I am encountering an issue with applying the textstat_readability function to a DF column. Following several lines of cleaning tweet text (~ 53K observations), I apply the text_readability function to create a new column called $Flesch from the…
-1
votes
1 answer

kwic in quanteda (R) does not identify more than one word in regex pattern

I am trying to identify regex patterns in text, but kwic() does not identify regex phrases that are longer than just one word. I tried to use phrase(), but that did not work either. To give you an example: mycorpus = corpus(bla$`TEXT` ) foo =…
Sherls
  • 31
  • 3
-1
votes
1 answer

R - convert DFM to LSA then compute cosine similarity: Error inherits(x, "Matrix") is not TRUE

I have a Document-Features-Matrix (DFM): I want to convert it into a LSA object and finally to compute cosine similarity between each documents. this are the passages I followed lsa_t2 <- convert(DFM_tfidf, to = "lsa" , omit_empty =…
Carbo
  • 906
  • 5
  • 23
-1
votes
1 answer

how is PcGw computed in quanteda's Naive Bayes?

Consider the usual example that replicates example from 13.1 of An Introduction to Information Retrieval https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf txt <- c(d1 = "Chinese Beijing Chinese", d2 = "Chinese Chinese Shanghai", …
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
-1
votes
1 answer

Text Similarity - Cosine - Control

I would like to ask you, if anybody could check my code, because it was behaving weird - not working, giving me errors to suddenly working without changing anything - the code will be at the bottom. Background: So my goal is to calculate text…
-1
votes
2 answers

How do I attach metadata to a text corpus with quanteda?

I am using quanteda to create a text corpus and trying to attach metadata, but I keep getting an error. I have used this code before on another dataset, but for some reason it's not working with my current dataset. The code is: dfm.ineq1 <-…
tlev
  • 83
  • 9
-2
votes
1 answer

Converting quanteda dfmSparse matrix->data.frame->h2o adds unwanted initial row of NaNs

I have a 10025x1417 TFIDF dfm matrix created with quanteda. (The actual class is dfmSparse which is a subclass of dfm-matrix). When I convert to h2o with as.data.frame and then as.h2o, I incorrectly get 10026x1417, with an unwanted extra first row…
smci
  • 32,567
  • 20
  • 113
  • 146
1 2 3
41
42