Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
0
votes
1 answer

I can't remove • and some other special characters such as '- using tm_map

I search through the questions and able to replace • in my first set of command. But when I apply to my corpus, it doesn't work, the • still appear. The corpus has 6570 elements,2.3mb, so it seems to be valid. > x <- ". R Tutorial" >…
Etalo
  • 11
  • 2
0
votes
1 answer

R: initialise empty dgCMatrix given by matrix multiplication of two Quanteda DFM sparse matrices?

I have for loop like this, trying to implement the solution here, with dummy vars such that aaa <- DFM %*% t(DFM) #DFM is Quanteda dfm-sparse-matrix for(i in 1:nrow(aaa)) aaa[i,] <- aaa[i,][order(aaa[i,], decreasing = TRUE)] but now for(i in…
hhh
  • 50,788
  • 62
  • 179
  • 282
0
votes
2 answers

R: sparse matrix multiplication with data.table and quanteda package?

I am trying to create a matrix mulptiplication with sparse matrix and with the package called quanteda, utilising data.table package, related to this thread here. So require(quanteda) mytext <- c("Let the big dogs hunt", "No holds barred", "My…
hhh
  • 50,788
  • 62
  • 179
  • 282
0
votes
0 answers

Quanteda Textfile Twitter JSON Error Reading

I am trying to use Quanteda's textfile wrapper to read in the JSON at the following link: My code is the following: textfile("20070101-20080214_ehfdpezgqg_2007_01_01_00_00_activities.json", textField = "body") But when I run this I obtain the…
mlachans
  • 49
  • 8
0
votes
1 answer

Quanteda - Apply Function to DFM Over Document Variables

I am using R's quanteda package and the latest versions for both R and the package. I have a corpus of documents which number in the millions. Let's suppose I have a DFM generated from quanteda with each document having a docvar of the date. There…
mlachans
  • 49
  • 8
0
votes
1 answer

R How to use maxCount scheme in Quanteda package

My question is simple, the Quanteda package in R has a function for calculating the Term Frequency (tf) of a Document Frequency Matrix (dfm). When you look at the description of tf function with ?tf, it says it has four arguments. My question is…
csmontt
  • 614
  • 8
  • 15
0
votes
1 answer

Quanteda - Extracting identified dictionary words

I am trying to extract the identified dictionary words from a Quanteda dfm, but have been unable to find a solution. Does someone have a solution for this? Sample input: dict <- dictionary(list(season = c("spring", "summer", "fall",…
0
votes
1 answer

quanteda not creating corpus from corpusSource object

I am using windows 7 with a 32-bit operating system with 4Gb RAM of which only 3Gb is accessible due to 32-bit limitations. I shut everything else down and can see that I have about 1Gb as cached and 1Gb available before starting. The "free" memory…
0
votes
1 answer

Seeding words into an LDA topic model in R

I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda package (with dfm matrices built in quanteda) in order to…
0
votes
1 answer

"Invalid class “dfmSparse” object" error when running dfm function in quanteda R package

I'm using quanteda, an R package for managing and analyzing text. I am running into trouble with one of its core functons: "dfm" which is used for constructing a document frequency matrix. Running the function # Install packages packages <-…
zxtonizx
  • 11
  • 1
0
votes
2 answers

Implementing N-grams in my corpus, Quanteda Error

I am trying to implement quanteda on my corpus in R, but I am getting: Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, : duplicate row.names: character(0) I don't have much experience with this. Here is a download of the…
gamelanguage
  • 103
  • 10
0
votes
2 answers

Computing n-grams on large corpus using R and Quanteda

I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package. I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation. If I…
Federico
  • 76
  • 7
0
votes
1 answer

R in Windows cannot handle some characters

I performed LDA in Linux and didn't get characters like "ø" in topic 2. However, when run in Windows, they show. Does anyone know how to deal with this? I used packages quanteda and topicmodels. > terms(LDAModel1,5) Topic 1 Topic 2 [1,] "car" …
user1569341
  • 333
  • 1
  • 6
  • 17
0
votes
1 answer

quanteda ngram works with mac but breaks in windows 7

I have a set of texts that I am processing for the Johns Hopkins Capstone project. I am using quanteda as my core text handling library. I work on my Macbook Pro at home and a Windows 7 64-bit at work. My R script appears to run correctly on my…
0
votes
1 answer

Error using NB model in textmodel() of quanteda package

I am trying to fit a model to dfm I created using quanteda. I am getting the following error. Any ideas?? tModel <- textmodel(udfm1,model = "NB", smooth=1) Error in textmodel(udfm1, model = "NB", smooth = 1) : model NB not implemented. p.s. I am…
PeterV
  • 195
  • 1
  • 13
1 2 3
41
42