Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
0
votes
1 answer

How to replace tokens (words) with stemmed versions of words from my own table?

I got data like this (simplified): library(quanteda) sample data myText <- c("ala ma kotka", "kasia ma pieska") myDF <- data.frame(myText) myDF$myText <- as.character(myDF$myText) tokenization tokens <- tokens(myDF$myText, what = "word", …
Garf
  • 75
  • 1
  • 12
0
votes
0 answers

Classifying texts at document and sentence level (using Quanteda and RTextTools)

I'm in the process of trying to figure out how to apply text classification using RTextTools on a corpus I downloaded from LexisNexis . I succeeded in both parsing LexisNexis N html files into a document feature matrices using the Quanteda package…
0
votes
1 answer

Why does featnames(myDFM) contain features of more than one or two tokens?

I'm working with a large 1M doc corpus and have applied several transformations when creating a document frequency matrix from it: library(quanteda) corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus() …
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
0
votes
1 answer

Display matching sentences by text typed in a Shiny app text box

I am trying to build an Shiny App that can dynamically display sentences from a database column by matching a Corpus from a text box , ie. as users starts typing the text in the text box, all the sentences that would match (corpus from the text…
Vikram Karthic
  • 468
  • 4
  • 18
0
votes
1 answer

join quanteda dfm top ten 1grams with all dfm 2 thru 5grams

To conserve memory space when dealing with a very large corpus sample i'm looking to take just the top 10 1grams and combine those with all of the 2 thru 5grams to form my single quanteda::dfmSparse object that will be used in natural language…
myusrn
  • 1,050
  • 2
  • 15
  • 29
0
votes
1 answer

Quanteda: how to plot lexical diversity as a function of time?

I have calculated lexical diversity for my DFM in Quanteda, and want to plot that over time. I have variables for year, month, and date in my corpus for each document as docvars. Is there some way to combine these data and produce a plot of lexical…
nasserq
  • 1
  • 3
0
votes
1 answer

TM, Quanteda, text2vec. Get strings on the left of term in wordlist according to regex pattern

I would like to analyse a big folder of texts for the presence of names, addressess and telephone numbers in several languages. These will usually be preceded with a word "Address", "telephone number", "name", "company", "hospital", "deliverer". I…
Jacek Kotowski
  • 620
  • 16
  • 49
0
votes
1 answer

KWIC into existing dataframe in R

I'd like to take the result of a Quanteda package and add it to an existing spreadsheet. For example: newdf<- as.data.frame(kwic(x, keywords, window = 5, valuetype = c("glob", "regex", "fixed"),case_insensitive = TRUE, ...)) This creates a…
Alex
  • 77
  • 1
  • 10
0
votes
2 answers

How to Cast a Dataframe into a DTM

I'd like to cast my table into a DTM and maintain the metadata. Each row should be a document. But in order to use the cast_dtm(), there needs to be a count variable. In order to "cast", it needs to be in the "Document, Term, Count" format. How…
Alex
  • 77
  • 1
  • 10
0
votes
1 answer

Feature extraction using Chi2 with Quanteda

I have a dataframe df with this structure : Rank Review 5 good film 8 very good film .. Then I tried to create a DocumentTermMatris using quanteda package : mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE) I would like…
dr.nasri84
  • 79
  • 2
  • 9
0
votes
1 answer

Document-Term Matrix with Quanteda

I have a dataframe df with this structure : Rank Review 5 good film 8 very goood film .. Then I tried to create a DocumentTermMatris using quanteda package : temp.tf <- df$Review %>% tokens(ngrams = 1:1) %>% # generate tokens + dfm %>% #…
dr.nasri84
  • 79
  • 2
  • 9
0
votes
1 answer

Split up ngrams in document-feature matrix (quanteda)

I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams? head(dfm, n = 3, nfeature = 4) docs in_the great plenary emission_reduction …
uyanik
  • 63
  • 7
0
votes
1 answer

Compute chi square value between ngrams and documents with Quanteda

I use Quanteda R package in order to extract ngrams (here 1grams and 2grams) from text Data_clean$Review, but I am looking for a way with R to compte Chi-square between document and the extracted ngrams : Here the R code that I did to clean Up text…
dr.nasri84
  • 79
  • 2
  • 9
0
votes
1 answer

Quanteda phrasetotoken does not work

Situation 1 I get strange results when applying the phrasetotoken function in the Quanteda packages: dict <- dictionary(list(words = ......*lokale energie productie*......)) txt <- c("I like lokale energie producties) phrasetotoken(txt,…
pmkruyen
  • 142
  • 13
0
votes
0 answers

Random Forest using ngrams with R

Im new with R, I try to do sentiment analysis using customer reviews using Random Forest. Fo this I would like to use ngrams (bigrams and trigrams) as feautures (I used the quanteda R package quanteda package. Here is the R code : train <-…
dr.nasri84
  • 79
  • 2
  • 9