Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

I can't remove • and some other special characters such as '- using tm_map

I search through the questions and able to replace • in my first set of command. But when I apply to my corpus, it doesn't work, the • still appear. The corpus has 6570 elements,2.3mb, so it seems to be valid. > x <- ". R Tutorial" >…

r gsub tm quanteda

asked Mar 22 '17 at 03:29

Etalo

votes

1 answer

R: initialise empty dgCMatrix given by matrix multiplication of two Quanteda DFM sparse matrices?

I have for loop like this, trying to implement the solution here, with dummy vars such that aaa <- DFM %*% t(DFM) #DFM is Quanteda dfm-sparse-matrix for(i in 1:nrow(aaa)) aaa[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)] but now for(i in…

r initialization sparse-matrix matrix-multiplication quanteda

asked Jan 11 '17 at 16:46

hhh

50,788
62
179
282

votes

2 answers

R: sparse matrix multiplication with data.table and quanteda package?

I am trying to create a matrix mulptiplication with sparse matrix and with the package called quanteda, utilising data.table package, related to this thread here. So require(quanteda) mytext <- c("Let the big dogs hunt", "No holds barred", "My…

r matrix data.table sparse-matrix quanteda

asked Jan 09 '17 at 15:22

hhh

50,788
62
179
282

votes

0 answers

Quanteda Textfile Twitter JSON Error Reading

I am trying to use Quanteda's textfile wrapper to read in the JSON at the following link: My code is the following: textfile("20070101-20080214_ehfdpezgqg_2007_01_01_00_00_activities.json", textField = "body") But when I run this I obtain the…

json r text quanteda

asked Dec 22 '16 at 04:10

mlachans

votes

1 answer

Quanteda - Apply Function to DFM Over Document Variables

I am using R's quanteda package and the latest versions for both R and the package. I have a corpus of documents which number in the millions. Let's suppose I have a DFM generated from quanteda with each document having a docvar of the date. There…

r quanteda

asked Nov 29 '16 at 02:28

mlachans

votes

1 answer

R How to use maxCount scheme in Quanteda package

My question is simple, the Quanteda package in R has a function for calculating the Term Frequency (tf) of a Document Frequency Matrix (dfm). When you look at the description of tf function with ?tf, it says it has four arguments. My question is…

r tf-idf quanteda

asked Oct 14 '16 at 03:05

csmontt

votes

1 answer

Quanteda - Extracting identified dictionary words

I am trying to extract the identified dictionary words from a Quanteda dfm, but have been unable to find a solution. Does someone have a solution for this? Sample input: dict <- dictionary(list(season = c("spring", "summer", "fall",…

r text-mining quanteda

asked Sep 28 '16 at 11:38

Frederik Andersen

votes

1 answer

quanteda not creating corpus from corpusSource object

I am using windows 7 with a 32-bit operating system with 4Gb RAM of which only 3Gb is accessible due to 32-bit limitations. I shut everything else down and can see that I have about 1Gb as cached and 1Gb available before starting. The "free" memory…

r corpus quanteda

asked Aug 18 '16 at 20:08

B. McCracken

votes

1 answer

Seeding words into an LDA topic model in R

I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda package (with dfm matrices built in quanteda) in order to…

r lda quanteda topicmodels

asked Jun 09 '16 at 13:02

Michael Bossetta

votes

1 answer

"Invalid class “dfmSparse” object" error when running dfm function in quanteda R package

I'm using quanteda, an R package for managing and analyzing text. I am running into trouble with one of its core functons: "dfm" which is used for constructing a document frequency matrix. Running the function # Install packages packages <-…

r text-analysis quanteda

asked Jun 08 '16 at 13:01

zxtonizx

votes

2 answers

Implementing N-grams in my corpus, Quanteda Error

I am trying to implement quanteda on my corpus in R, but I am getting: Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, : duplicate row.names: character(0) I don't have much experience with this. Here is a download of the…

r text analytics n-gram quanteda

asked Apr 14 '16 at 06:25

gamelanguage

votes

2 answers

Computing n-grams on large corpus using R and Quanteda

I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package. I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation. If I…

r nlp out-of-memory quanteda

asked Mar 29 '16 at 12:32

Federico

votes

1 answer

R in Windows cannot handle some characters

I performed LDA in Linux and didn't get characters like "ø" in topic 2. However, when run in Windows, they show. Does anyone know how to deal with this? I used packages quanteda and topicmodels. > terms(LDAModel1,5) Topic 1 Topic 2 [1,] "car" …

r windows lda topicmodels quanteda

asked Jan 13 '16 at 03:17

user1569341

votes

1 answer

quanteda ngram works with mac but breaks in windows 7

I have a set of texts that I am processing for the Johns Hopkins Capstone project. I am using quanteda as my core text handling library. I work on my Macbook Pro at home and a Windows 7 64-bit at work. My R script appears to run correctly on my…

r windows macos quanteda

asked Jan 05 '16 at 18:44

Harold Trammel

votes

1 answer

Error using NB model in textmodel() of quanteda package

I am trying to fit a model to dfm I created using quanteda. I am getting the following error. Any ideas?? tModel <- textmodel(udfm1,model = "NB", smooth=1) Error in textmodel(udfm1, model = "NB", smooth = 1) : model NB not implemented. p.s. I am…

r text-mining cross-validation quanteda

asked Dec 29 '15 at 00:06

PeterV

Prev 1 2 3

…

42 Next