Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
2
votes
1 answer

Manipulate (rename and recombine) features in a dfm (quanteda)

I would like to manipulate (rename and combine) features in a dfm, how to proceed? The reason is as follows: I want to use a different stemming algorithm than the Porter stemmer implemented in Quanteda (the kpss algorithm called via Python).…
pmkruyen
  • 142
  • 13
2
votes
1 answer

QUANTEDA - invalid class “dfmSparse” object

I get this warning-message. I use these data: https://github.com/kbenoit/quanteda/tree/master/data/data_char_inaugural.RData RStudio version: Version 1.0.136 – © 2009-2016 RStudio, Inc. library(quanteda) uk2010immigCorpus <-…
majesus
  • 303
  • 2
  • 9
2
votes
1 answer

creating a dfm of words with letters

I am trying to create a dfm of letters from strings. I am facing issues when the dfm is unable to pick on can create features for punctuations such as "/" "-" "." or '. require(quanteda) dict = c('a','b','c','d','e','f','/',".",'-',"'") dict <-…
SuperSatya
  • 65
  • 1
  • 6
2
votes
1 answer

Change the length of ContextPre and ContextPost in Quanteda KWIC

Is there a way to increase the number of words appearing before and after the keyword in Quanteda kwic function? I've tried by changing the numeric value in: options(width = 200) but it didn't work. @KenBenoit
DebNa
  • 35
  • 6
2
votes
1 answer

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type. I'm attempting this with Quanteda and have…
Matt
  • 85
  • 6
2
votes
1 answer

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you…
Fiona_Wang
  • 163
  • 1
  • 2
  • 12
2
votes
1 answer

How to stem all words in an ngram, using quanteda?

I'm working with the Quanteda package in R at the moment, and I'd like to calculate the ngrams of a set of stemmed words to get a quick-and-dirty estimate of what content words tend to be near each other. If I try: twitter.files <-…
2
votes
2 answers

How to keep the beginning and end of sentence markers with quanteda

I'm trying to create 3-grams using R's quanteda package. I'm struggling to find a way to keep in the n-grams beginning and end of sentence markers, the and as in the code below. I thought that using the keptFeatures with a regular…
Giuseppe Romagnuolo
  • 3,362
  • 2
  • 30
  • 38
2
votes
1 answer

Quanteda with topicmodels: removed stopwords appear in results (Chinese)

My code: library(quanteda) library(topicmodels) # Some raw text as a vector postText <- c("普京 称 俄罗斯 未 乌克兰 施压 来自 头 条 新闻", "长期 电脑 前进 食 致癌 环球网 报道 乌克兰 学者 认为 电脑 前进 食 会 引发 癌症 等 病症 电磁 辐射 作用 电脑 旁 水 食物 会 逐渐 变质 有害 物质 累积 尽管 人体 短期 内 会 感到 适 会 渐渐 引发 出 癌症 阿尔茨海默 式…
Jackson-MSFT
  • 65
  • 1
  • 5
2
votes
1 answer

Using dictionary to create Bigram in Quanteda

I am trying to remove typos from my data text analysis. So I am using dictionary feature of quanteda package. It works fine for Unigrams. But it gives unexpected output for Bigrams. Not sure how to handle typos so that they do not sneak into my…
PeterV
  • 195
  • 1
  • 13
2
votes
1 answer

Form bigrams without stopwords in R

I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string as follows: "IBM have a great success in the…
John Chou
  • 107
  • 1
  • 8
2
votes
1 answer

Import lexisnexis output into R quanteda

I would to use Benoit's R-package quanteda to analyze articles exported from lexisnexis. The export is in the standard html-format. I use the tm package + plugin to read the lexisnexis output. Unfortunately, an error occurs when transforming the…
bstn
  • 23
  • 5
2
votes
2 answers

R tm Package: How to compare text to positive reference word list and return count of positive word occurrences

What is the best approach to use the tm library to compare text to positive reference word list and return count of positive word occurrences I want to be able to return the sum of positive words in reference text. Question: What is the best way to…
Technophobe01
  • 8,212
  • 3
  • 32
  • 59
1
vote
1 answer

Backtransform word tokens to a sentence-based corpus in Quanteda after preprocessing

I want to preprocess my text data using the {quanteda} package in R. To do so, I am creating a corpus, which is then tokenized and preprocessed (e.g. lowercase, remove punctuation, etc.). Ideally, I would then want to restore the initial sentence…
Dr. Fabian Habersack
  • 1,111
  • 12
  • 30
1
vote
1 answer

[readtext]: download files from the Internet to remove text via stringi and read the file into Quanteda

My aim is to read multiple text files into Quanteda, first removing unwanted text that is contained within # marks. Stringi code has been provided to perform this task, however, problems were encountered reading the file in Quanteda, regarding the…
bgreen
  • 63
  • 6