Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

Manipulate (rename and recombine) features in a dfm (quanteda)

I would like to manipulate (rename and combine) features in a dfm, how to proceed? The reason is as follows: I want to use a different stemming algorithm than the Porter stemmer implemented in Quanteda (the kpss algorithm called via Python).…

r quanteda

asked Mar 22 '17 at 22:04

pmkruyen

votes

1 answer

QUANTEDA - invalid class “dfmSparse” object

I get this warning-message. I use these data: https://github.com/kbenoit/quanteda/tree/master/data/data_char_inaugural.RData RStudio version: Version 1.0.136 – © 2009-2016 RStudio, Inc. library(quanteda) uk2010immigCorpus <-…

r quanteda

asked Feb 03 '17 at 13:50

majesus

votes

1 answer

creating a dfm of words with letters

I am trying to create a dfm of letters from strings. I am facing issues when the dfm is unable to pick on can create features for punctuations such as "/" "-" "." or '. require(quanteda) dict = c('a','b','c','d','e','f','/',".",'-',"'") dict <-…

r sapply quanteda dfm

asked Nov 20 '16 at 02:10

SuperSatya

votes

1 answer

Change the length of ContextPre and ContextPost in Quanteda KWIC

Is there a way to increase the number of words appearing before and after the keyword in Quanteda kwic function? I've tried by changing the numeric value in: options(width = 200) but it didn't work. @KenBenoit

r text-mining quanteda

asked May 25 '16 at 11:08

DebNa

votes

1 answer

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type. I'm attempting this with Quanteda and have…

r quanteda

asked May 02 '16 at 03:56

Matt

votes

1 answer

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you…

r dictionary text-mining term-document-matrix quanteda

asked Apr 20 '16 at 02:18

Fiona_Wang

votes

1 answer

How to stem all words in an ngram, using quanteda?

I'm working with the Quanteda package in R at the moment, and I'd like to calculate the ngrams of a set of stemmed words to get a quick-and-dirty estimate of what content words tend to be near each other. If I try: twitter.files <-…

r nlp n-gram stemming quanteda

asked Apr 09 '16 at 16:03

Michael Anderson

votes

2 answers

How to keep the beginning and end of sentence markers with quanteda

I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the ~~and~~ as in the code below. I thought that using the keptFeatures with a regular…

r nlp text-mining tm quanteda

asked Mar 30 '16 at 23:33

Giuseppe Romagnuolo

3,362
2
30
38

votes

1 answer

Quanteda with topicmodels: removed stopwords appear in results (Chinese)

My code: library(quanteda) library(topicmodels) # Some raw text as a vector postText <- c("普京称俄罗斯未乌克兰施压来自头条新闻", "长期电脑前进食致癌环球网报道乌克兰学者认为电脑前进食会引发癌症等病症电磁辐射作用电脑旁水食物会逐渐变质有害物质累积尽管人体短期内会感到适会渐渐引发出癌症阿尔茨海默式…

r topic-modeling topicmodels quanteda

asked Mar 24 '16 at 21:16

Jackson-MSFT

votes

1 answer

Using dictionary to create Bigram in Quanteda

I am trying to remove typos from my data text analysis. So I am using dictionary feature of quanteda package. It works fine for Unigrams. But it gives unexpected output for Bigrams. Not sure how to handle typos so that they do not sneak into my…

r quanteda

asked Dec 26 '15 at 19:41

PeterV

votes

1 answer

Form bigrams without stopwords in R

I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string as follows: "IBM have a great success in the…

r text-mining tm n-gram quanteda

asked Dec 15 '15 at 06:22

John Chou

votes

1 answer

Import lexisnexis output into R quanteda

I would to use Benoit's R-package quanteda to analyze articles exported from lexisnexis. The export is in the standard html-format. I use the tm package + plugin to read the lexisnexis output. Unfortunately, an error occurs when transforming the…

r tm quanteda

asked Dec 08 '15 at 21:04

bstn

votes

2 answers

R tm Package: How to compare text to positive reference word list and return count of positive word occurrences

What is the best approach to use the tm library to compare text to positive reference word list and return count of positive word occurrences I want to be able to return the sum of positive words in reference text. Question: What is the best way to…

r tm quanteda

asked Nov 21 '15 at 05:38

Technophobe01

8,212
3
32
59

vote

1 answer

Backtransform word tokens to a sentence-based corpus in Quanteda after preprocessing

I want to preprocess my text data using the {quanteda} package in R. To do so, I am creating a corpus, which is then tokenized and preprocessed (e.g. lowercase, remove punctuation, etc.). Ideally, I would then want to restore the initial sentence…

r quanteda data-preprocessing

asked Aug 19 '23 at 15:15

Dr. Fabian Habersack

1,111
12
30

vote

1 answer

[readtext]: download files from the Internet to remove text via stringi and read the file into Quanteda

My aim is to read multiple text files into Quanteda, first removing unwanted text that is contained within # marks. Stringi code has been provided to perform this task, however, problems were encountered reading the file in Quanteda, regarding the…

quanteda stringi

asked Jul 26 '23 at 09:55

bgreen

Prev 1 2 3

…

41 42 Next