Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
3
votes
1 answer

How to create a quanteda corpus from a data.frame with multiple columns for text?

lets say i have the following: x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), text1=c('this is text','so is this','and this is too.'), text2=c('we have more text here','and here too','and look at this, more text.')) I want…
Ted Mosby
  • 1,426
  • 1
  • 16
  • 41
3
votes
1 answer

2 word phrase collocations using quanteda in R

This is regarding the textstat_collocations functionality in quanteda package in R. I am getting more than 2 word phrases in the output even though I am requesting only for the 2 word phrases. The necessary processing steps are as follows (corpus1…
ds_newbie
  • 79
  • 8
3
votes
0 answers

Is there a faster way to join/concatenate two tokens in R ?

I am working with EMR data. Lots of entities within medical records are split into two different words (example - CT Scan) but I plan on joining these tokens to a single word by using an underscore (CT_Scan). Is there a faster way to perform this…
x1carbon
  • 287
  • 1
  • 15
3
votes
3 answers

Quanteda: Fastest way to replace tokens with lemma from dictionary?

Is there a much faster alternative to R quanteda::tokens_lookup()? I use tokens() in the 'quanteda' R package to tokenize a data frame with 2000 documents. Each document is 50 - 600 words. This takes a couple of seconds on my PC (Microsoft R Open…
Geir Inge
  • 179
  • 3
  • 10
3
votes
2 answers

Remove ngrams with leading and trailing stopwords

I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted them to plain-text files through an Adobe batch…
syre
  • 902
  • 1
  • 7
  • 19
3
votes
1 answer

Identify Nouns using Quanteda Corpuses

I am using the quanteda package by Ken Benoit and Paul Nulty to work with textual data. My corpus contains texts with full German sentences and I want to work with the nouns of every text only. One trick in German is to use the upper case words…
CFM
  • 89
  • 10
3
votes
2 answers

Quanteda package, Naive Bayes: How can I predict on different-featured test data?

I used quanteda::textmodel_NB to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer. Now, I am trying to use it this summer to categorize new text we get here at work. I…
Mark White
  • 1,228
  • 2
  • 10
  • 25
3
votes
1 answer

Create dfm step by step with quanteda

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one…
000andy8484
  • 563
  • 3
  • 16
3
votes
1 answer

Working with text classification and big sparse matrices in R

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would…
Ed.
  • 846
  • 6
  • 24
3
votes
2 answers

Assigning weights to different features in R

Is it possible to assign weights to different features before formulating a DFM in R? Consider this example in R str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE) DFM mydfm looks like: docs…
Rahul Chawla
  • 190
  • 1
  • 3
  • 14
3
votes
1 answer

R text mining how to segment document into phrases not terms

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you…
Fiona_Wang
  • 163
  • 1
  • 2
  • 12
2
votes
2 answers

How to remove underscores from a text in Quanteda Tokens in R

EDIT See EDIT below I'm trying to convert a corpus object to tokens using R and Quanteda. Using the options in token() I cannot seem to remove the underscores in some words/characters. When I try using stri_replace_all_regex() the characters…
DartLazer
  • 31
  • 5
2
votes
1 answer

How can I calculate cosine similarity between two sets of individual documents, using quanteda?

I have two sets of documents: One with approx. 580 news articles and one with approx. 560 political decisions. I want to find out whether there are similarities between the individual news articles and the political decisions. This means that each…
2
votes
4 answers

In R, how to find the locations of all dictionary words, in a dataframe?

I'm analyzing corporate meetings, and I want to measure at what time people in the meetings bring up certain topics. Time meaning the location of the words. For example, in three meetings, when do people bring up "unionizing" and other words in my…
Kasi
  • 235
  • 2
  • 11
2
votes
1 answer

Tokenization of Compound Words not Working in Quanteda

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset. This is the subset of the dataset I'm using as a…
kornpat
  • 27
  • 3
1 2
3
41 42