Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

How to create a quanteda corpus from a data.frame with multiple columns for text?

lets say i have the following: x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), text1=c('this is text','so is this','and this is too.'), text2=c('we have more text here','and here too','and look at this, more text.')) I want…

r quanteda

asked Feb 06 '18 at 18:09

Ted Mosby

1,426
1
16
41

votes

1 answer

2 word phrase collocations using quanteda in R

This is regarding the textstat_collocations functionality in quanteda package in R. I am getting more than 2 word phrases in the output even though I am requesting only for the 2 word phrases. The necessary processing steps are as follows (corpus1…

r text-processing quanteda collocation

asked Jan 29 '18 at 06:43

ds_newbie

votes

0 answers

Is there a faster way to join/concatenate two tokens in R ?

I am working with EMR data. Lots of entities within medical records are split into two different words (example - CT Scan) but I plan on joining these tokens to a single word by using an underscore (CT_Scan). Is there a faster way to perform this…

r nlp bioinformatics quanteda

asked Dec 30 '17 at 23:26

x1carbon

votes

3 answers

Quanteda: Fastest way to replace tokens with lemma from dictionary?

Is there a much faster alternative to R quanteda::tokens_lookup()? I use tokens() in the 'quanteda' R package to tokenize a data frame with 2000 documents. Each document is 50 - 600 words. This takes a couple of seconds on my PC (Microsoft R Open…

r dictionary text token quanteda

asked Oct 13 '17 at 13:45

Geir Inge

votes

2 answers

Remove ngrams with leading and trailing stopwords

I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords. I have about 100 pdf files. I converted them to plain-text files through an Adobe batch…

r text-mining tm quanteda

asked Oct 11 '17 at 10:10

syre

votes

1 answer

Identify Nouns using Quanteda Corpuses

I am using the quanteda package by Ken Benoit and Paul Nulty to work with textual data. My corpus contains texts with full German sentences and I want to work with the nouns of every text only. One trick in German is to use the upper case words…

r spacy quanteda

asked Aug 24 '17 at 08:52

CFM

votes

2 answers

Quanteda package, Naive Bayes: How can I predict on different-featured test data?

I used quanteda::textmodel_NB to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer. Now, I am trying to use it this summer to categorize new text we get here at work. I…

r naivebayes text-analysis quanteda

asked May 23 '17 at 13:51

Mark White

1,228
2
10
25

votes

1 answer

Create dfm step by step with quanteda

I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one…

r text-analysis term-document-matrix quanteda

asked Aug 13 '16 at 09:54

000andy8484

votes

1 answer

Working with text classification and big sparse matrices in R

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would…

r classification text-mining r-caret quanteda

asked Aug 03 '16 at 23:28

Ed.

votes

2 answers

Assigning weights to different features in R

Is it possible to assign weights to different features before formulating a DFM in R? Consider this example in R str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE) DFM mydfm looks like: docs…

r text-mining tm quanteda

asked Apr 23 '16 at 20:17

Rahul Chawla

votes

1 answer

R text mining how to segment document into phrases not terms

When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you…

r text-mining n-gram term-document-matrix quanteda

asked Apr 18 '16 at 09:24

Fiona_Wang

votes

2 answers

How to remove underscores from a text in Quanteda Tokens in R

EDIT See EDIT below I'm trying to convert a corpus object to tokens using R and Quanteda. Using the options in token() I cannot seem to remove the underscores in some words/characters. When I try using stri_replace_all_regex() the characters…

r regex quanteda

asked Dec 16 '22 at 16:29

DartLazer

votes

1 answer

How can I calculate cosine similarity between two sets of individual documents, using quanteda?

I have two sets of documents: One with approx. 580 news articles and one with approx. 560 political decisions. I want to find out whether there are similarities between the individual news articles and the political decisions. This means that each…

r cosine-similarity quanteda

asked Jun 24 '22 at 09:04

Jonas Videbæk Jørgensen

votes

4 answers

In R, how to find the locations of all dictionary words, in a dataframe?

I'm analyzing corporate meetings, and I want to measure at what time people in the meetings bring up certain topics. Time meaning the location of the words. For example, in three meetings, when do people bring up "unionizing" and other words in my…

r text nlp tidyverse quanteda

asked May 30 '22 at 01:47

Kasi

votes

1 answer

Tokenization of Compound Words not Working in Quanteda

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset. This is the subset of the dataset I'm using as a…

r nlp token quanteda

asked May 02 '22 at 18:39

kornpat

Prev 1 2

…

41 42 Next