Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
0
votes
1 answer

r quanteda top features extraction returning modified words

I have tried using quanteda to extract top features but the results were modified words, i.e. 'faulti' instead of 'faulty'. Is this supposed to be the expected results? I have tried searching for the top features keywords in the original dataset…
Lenz
  • 13
  • 3
0
votes
1 answer

Quanteda what does the variable Types mean that is returned by summary(corpus)?

I was studying the quanteda package from R and I just could not find from the documents what the variable called Types that is returned by summary(immig_corp) means. require(quanteda) require(readtext) Now I create the corpus: immig_corp <-…
BRCN
  • 635
  • 1
  • 12
  • 26
0
votes
1 answer

Quanteda: how to get ngrams, and their frequences, given n-1 predecessor words/types

For next word prediction using ngrams I would need to find all the ngrams (and their frequencies) given n-1 predecessor words. In dfm I could not see any way to do that, so started implementing it manually on texstat_frequency (data.frame). After…
user778806
  • 67
  • 6
0
votes
1 answer

add docvars to dfm from separate data.frame r

After spending much time developing the proper corpus (e.g. stopwords, tf-idf) I created a dtm in the tmpackage and ran my topic model. I then proceeded to compare the topics to some document level covariates of interest, only to learn that stm…
SeekingData
  • 115
  • 6
0
votes
3 answers

R: How get file name with Quanteda: char_segment

I am using char_segment from Quanteda library to separate multiple documents from one file separatted by a pattern, this command works great and easily! (I did try with str_match and strsplit but without success). Lamentably I am unable to get the…
Rodrigo B
  • 21
  • 7
0
votes
1 answer

Remove a section from Corpus

I have a quanteda corpus of hundreds of documents. How do I remove specific sections - like the abstract and footnotes etc. Otherwise, I am faced with doing it manually. Thanks As requested, here is a text example. It is from a regular journal…
0
votes
1 answer

How to convert new text data to a predefined dfm?

I am doing topic modeling with the package topicmodels. So I new to split the data into train set and test set. I wonder is it possible to transform the test data into a predefined dfm object (generated by the training data). Thanks
Bin H.
  • 75
  • 1
  • 6
0
votes
1 answer

obtaining textual data from a single column in dataframe

I want to read as text only one specific column of my dataframe, i.e. the 3rd column C, and create a word cloud. Let df= A B C 1 2 sheep 2 2 sheep 3 4 goat 4 5 camel 5 2 camel 6 1 camel I am try to readLines from readLines(df$C) but I get the…
Economist_Ayahuasca
  • 1,648
  • 24
  • 33
0
votes
1 answer

multiple co-occurence clusters on single term

I have corpus in which a key term occurs at least once. From this I made fcm that looks much like this. txts <- c("a a a b b c", "a a c e", "a c b e f g", "e d j b", "b g k l", "b a a g l", "e c b j k l", "b g w m") total <- fcm(txts, context =…
Mel Schickel
  • 47
  • 1
  • 8
0
votes
1 answer

Quanteda problems in R

I am using Quanteda in R and have created the corpus and dfm. However, I notice that the dfm and corpus contain less documents than the original file. I would appreciate if anyone could please let me know why this happens and how to fix? Thanks
0
votes
3 answers

quanteda convert to topicmodels retaining docvars

I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given…
0
votes
1 answer

Text Similarity using PoS tag

I want to calculate text similarity by using only the words of a specific POS tag. Currently I am calculating similarity using cosine method but it does not take into account POS tagging. A <- data.frame(name = c( "X-ray right leg arteries", …
john
  • 1,026
  • 8
  • 19
0
votes
1 answer

time series analysis of text in r

If i have some data like so: df = data.frame(person = c('jim','john','pam','jim'), date =c('2018-01-01','2018-02-01','2018-03-01','2018-04-01'), text = c('the lonely engineer','tax season is upon us, engineers, do…
Ted Mosby
  • 1,426
  • 1
  • 16
  • 41
0
votes
0 answers

R: Is it possible to make package Quanteda and package Workspace talk to each other?

I would like to compute some distributional similarity on running text. There is a nice function in package Quanteda called fcm, which creates a co-occurence matrix from text. For example: txt <- c("The quick brown fox jumped over the lazy…
Marina Santini
  • 99
  • 1
  • 3
  • 12
0
votes
0 answers

Producing a Network Graph from Correlations Between Words in a Document

I am interested in creating a network graph similar to the one displayed on this persons website - this first one on this page >> http://minimaxir.com/2016/12/interactive-network/ I would like to make the nodes of this graph == words in a .txt…
Davide Lorino
  • 875
  • 1
  • 9
  • 27