Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

CRAN page
Source code on GitHub (including the latest version in the dev branch)

627 questions

votes

1 answer

r quanteda top features extraction returning modified words

I have tried using quanteda to extract top features but the results were modified words, i.e. 'faulti' instead of 'faulty'. Is this supposed to be the expected results? I have tried searching for the top features keywords in the original dataset…

r quanteda

asked Aug 21 '18 at 03:59

Lenz

votes

1 answer

Quanteda what does the variable Types mean that is returned by summary(corpus)?

I was studying the quanteda package from R and I just could not find from the documents what the variable called Types that is returned by summary(immig_corp) means. require(quanteda) require(readtext) Now I create the corpus: immig_corp <-…

r quanteda

asked Aug 19 '18 at 13:41

BRCN

votes

1 answer

Quanteda: how to get ngrams, and their frequences, given n-1 predecessor words/types

For next word prediction using ngrams I would need to find all the ngrams (and their frequencies) given n-1 predecessor words. In dfm I could not see any way to do that, so started implementing it manually on texstat_frequency (data.frame). After…

quanteda dfm

asked Aug 09 '18 at 10:52

user778806

votes

1 answer

add docvars to dfm from separate data.frame r

After spending much time developing the proper corpus (e.g. stopwords, tf-idf) I created a dtm in the tmpackage and ran my topic model. I then proceeded to compare the topics to some document level covariates of interest, only to learn that stm…

r topic-modeling quanteda dfm

asked Jul 13 '18 at 15:10

SeekingData

votes

3 answers

R: How get file name with Quanteda: char_segment

I am using char_segment from Quanteda library to separate multiple documents from one file separatted by a pattern, this command works great and easily! (I did try with str_match and strsplit but without success). Lamentably I am unable to get the…

r split text-mining quanteda

asked Jul 05 '18 at 01:10

Rodrigo B

votes

1 answer

Remove a section from Corpus

I have a quanteda corpus of hundreds of documents. How do I remove specific sections - like the abstract and footnotes etc. Otherwise, I am faced with doing it manually. Thanks As requested, here is a text example. It is from a regular journal…

r quanteda

asked Jun 21 '18 at 12:42

Nicholas Bradley

votes

1 answer

How to convert new text data to a predefined dfm?

I am doing topic modeling with the package topicmodels. So I new to split the data into train set and test set. I wonder is it possible to transform the test data into a predefined dfm object (generated by the training data). Thanks

quanteda

asked Jun 12 '18 at 17:22

Bin H.

votes

1 answer

obtaining textual data from a single column in dataframe

I want to read as text only one specific column of my dataframe, i.e. the 3rd column C, and create a word cloud. Let df= A B C 1 2 sheep 2 2 sheep 3 4 goat 4 5 camel 5 2 camel 6 1 camel I am try to readLines from readLines(df$C) but I get the…

r readline quanteda

asked Jun 12 '18 at 15:34

Economist_Ayahuasca

1,648
24
33

votes

1 answer

multiple co-occurence clusters on single term

I have corpus in which a key term occurs at least once. From this I made fcm that looks much like this. txts <- c("a a a b b c", "a a c e", "a c b e f g", "e d j b", "b g k l", "b a a g l", "e c b j k l", "b g w m") total <- fcm(txts, context =…

r nlp cluster-analysis quanteda

asked Jun 02 '18 at 09:19

Mel Schickel

votes

1 answer

Quanteda problems in R

I am using Quanteda in R and have created the corpus and dfm. However, I notice that the dfm and corpus contain less documents than the original file. I would appreciate if anyone could please let me know why this happens and how to fix? Thanks

r text-mining quanteda

asked May 30 '18 at 14:20

Nicholas Bradley

votes

3 answers

quanteda convert to topicmodels retaining docvars

I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given…

r quanteda topicmodels

asked May 29 '18 at 17:43

fritsvegters

votes

1 answer

Text Similarity using PoS tag

I want to calculate text similarity by using only the words of a specific POS tag. Currently I am calculating similarity using cosine method but it does not take into account POS tagging. A <- data.frame(name = c( "X-ray right leg arteries", …

r quanteda udpipe

asked May 16 '18 at 19:31

john

1,026
8
19

votes

1 answer

time series analysis of text in r

If i have some data like so: df = data.frame(person = c('jim','john','pam','jim'), date =c('2018-01-01','2018-02-01','2018-03-01','2018-04-01'), text = c('the lonely engineer','tax season is upon us, engineers, do…

r quanteda

asked Apr 23 '18 at 23:44

Ted Mosby

1,426
1
16
41

votes

0 answers

R: Is it possible to make package Quanteda and package Workspace talk to each other?

I would like to compute some distributional similarity on running text. There is a nice function in package Quanteda called fcm, which creates a co-occurence matrix from text. For example: txt <- c("The quick brown fox jumped over the lazy…

r quanteda

asked Apr 21 '18 at 11:41

Marina Santini

votes

0 answers

Producing a Network Graph from Correlations Between Words in a Document

I am interested in creating a network graph similar to the one displayed on this persons website - this first one on this page >> http://minimaxir.com/2016/12/interactive-network/ I would like to make the nodes of this graph == words in a .txt…

r ggplot2 igraph quanteda ggnetwork

asked Apr 19 '18 at 16:44

Davide Lorino

Prev 1 2 3

…

41 42 Next