Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

How to replace tokens (words) with stemmed versions of words from my own table?

I got data like this (simplified): library(quanteda) sample data myText <- c("ala ma kotka", "kasia ma pieska") myDF <- data.frame(myText) myDF$myText <- as.character(myDF$myText) tokenization tokens <- tokens(myDF$myText, what = "word", …

asked Sep 27 '17 at 13:23

Garf

votes

0 answers

Classifying texts at document and sentence level (using Quanteda and RTextTools)

I'm in the process of trying to figure out how to apply text classification using RTextTools on a corpus I downloaded from LexisNexis . I succeeded in both parsing LexisNexis N html files into a document feature matrices using the Quanteda package…

classification text-mining quanteda

asked Sep 20 '17 at 08:45

fritsvegters

votes

1 answer

Why does featnames(myDFM) contain features of more than one or two tokens?

I'm working with a large 1M doc corpus and have applied several transformations when creating a document frequency matrix from it: library(quanteda) corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus() …

r quanteda

asked Aug 30 '17 at 09:09

Doug Fir

19,971
47
169
299

votes

1 answer

Display matching sentences by text typed in a Shiny app text box

I am trying to build an Shiny App that can dynamically display sentences from a database column by matching a Corpus from a text box , ie. as users starts typing the text in the text box, all the sentences that would match (corpus from the text…

r shiny tm quanteda

asked Aug 14 '17 at 11:04

Vikram Karthic

votes

1 answer

join quanteda dfm top ten 1grams with all dfm 2 thru 5grams

To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language…

r nlp sparse-matrix quanteda

asked Aug 13 '17 at 02:16

myusrn

1,050
2
15
29

votes

1 answer

Quanteda: how to plot lexical diversity as a function of time?

I have calculated lexical diversity for my DFM in Quanteda, and want to plot that over time. I have variables for year, month, and date in my corpus for each document as docvars. Is there some way to combine these data and produce a plot of lexical…

r quanteda

asked Aug 01 '17 at 16:52

nasserq

votes

1 answer

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages. These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I…

r tm quanteda text2vec

asked Jul 31 '17 at 08:19

Jacek Kotowski

votes

1 answer

KWIC into existing dataframe in R

I'd like to take the result of a Quanteda package and add it to an existing spreadsheet. For example: newdf<- as.data.frame(kwic(x, keywords, window = 5, valuetype = c("glob", "regex", "fixed"),case_insensitive = TRUE, ...)) This creates a…

r dataframe quanteda

asked Jul 19 '17 at 19:15

Alex

votes

2 answers

How to Cast a Dataframe into a DTM

I'd like to cast my table into a DTM and maintain the metadata. Each row should be a document. But in order to use the cast_dtm(), there needs to be a count variable. In order to "cast", it needs to be in the "Document, Term, Count" format. How…

r tidy quanteda qdap tidytext

asked Jun 21 '17 at 15:43

Alex

votes

1 answer

Feature extraction using Chi2 with Quanteda

I have a dataframe df with this structure : Rank Review 5 good film 8 very good film .. Then I tried to create a DocumentTermMatris using quanteda package : mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE) I would like…

r matrix quanteda

asked Jun 01 '17 at 15:05

dr.nasri84

votes

1 answer

Document-Term Matrix with Quanteda

I have a dataframe df with this structure : Rank Review 5 good film 8 very goood film .. Then I tried to create a DocumentTermMatris using quanteda package : temp.tf <- df$Review %>% tokens(ngrams = 1:1) %>% # generate tokens + dfm %>% #…

r matrix dataframe quanteda

asked Jun 01 '17 at 09:33

dr.nasri84

votes

1 answer

Split up ngrams in document-feature matrix (quanteda)

I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams? head(dfm, n = 3, nfeature = 4) docs in_the great plenary emission_reduction …

r quanteda

asked May 24 '17 at 12:48

uyanik

votes

1 answer

Compute chi square value between ngrams and documents with Quanteda

I use Quanteda R package in order to extract ngrams (here 1grams and 2grams) from text Data_clean$Review, but I am looking for a way with R to compte Chi-square between document and the extracted ngrams : Here the R code that I did to clean Up text…

r text-mining quanteda

asked May 17 '17 at 16:21

dr.nasri84

votes

1 answer

Quanteda phrasetotoken does not work

Situation 1 I get strange results when applying the phrasetotoken function in the Quanteda packages: dict <- dictionary(list(words = ......*lokale energie productie*......)) txt <- c("I like lokale energie producties) phrasetotoken(txt,…

r quanteda

asked May 17 '17 at 09:07

pmkruyen

votes

0 answers

Random Forest using ngrams with R

Im new with R, I try to do sentiment analysis using customer reviews using Random Forest. Fo this I would like to use ngrams (bigrams and trigrams) as feautures (I used the quanteda R package quanteda package. Here is the R code : train <-…

r text-mining sentiment-analysis n-gram quanteda

asked May 10 '17 at 14:26

dr.nasri84

Prev 1 2 3

…

41 42 Next