Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

627 questions
2
votes
1 answer

Keyword in context (kwic) for skipgrams?

I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is…
2
votes
1 answer

Identify WHICH words in a document have been matched by dictionary lookup and how many times

Quanteda question. For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much. Put differently, I want to get a matrix of the features in each…
jules
  • 31
  • 3
2
votes
3 answers

how to extract ngrams from a text in R (newspaper articles)

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm: dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE) I am trying to…
katwag97
  • 29
  • 3
2
votes
1 answer

Quanteda: How to look up patterns of two or more words in a phrase, when there can be any number of words in between?

I want to match some patterns in a text in R using the package {quanteda} and the tokens_lookup() function with the default valuetype="glob". The pattern would be the occurrence of one word in connection with a second word located anywhere in the…
Dr. Fabian Habersack
  • 1,111
  • 12
  • 30
2
votes
1 answer

Approximate string matching in R between two datasets

I have the following dataset containing film titles and the corresponding genre, while another dataset contains plain text where these titles might be quoted or not: dt1 title genre Secret in Their Eyes…
Carbo
  • 906
  • 5
  • 23
2
votes
1 answer

Why does textstat_simil() with method "cosine" returns NA

I am computing cosine similarity over two dfm objects. One is my reference object which has dimensions 5 x 4,728 while the second dfm is my target object and has dimensions 2,325,329 x 40,595. What I don't understand is why textstat_simil() returns…
Francesco Grossetti
  • 1,555
  • 9
  • 17
2
votes
1 answer

Searching for advanced regex patterns with kwic()

I want to use kwic() to find patterns in text with more advanced regex phrases, but I am struggling with the way kwic() is tokenising phrases and two problems evolved: 1) How to use grouping with phrases that contain whitespace: kwic(text, pattern…
Sherls
  • 31
  • 3
2
votes
2 answers

Select phrases found in dictionary and return dataframe of doc_id and phrase

I have a dictionary file of medical phrases and a corpus of raw texts. I'm trying to use the dictionary file to select the relevant phrases from the text. Phrases, in this case, are 1 to 5-word n-grams. In the end, I would like the selected phrases…
Obed
  • 403
  • 3
  • 12
2
votes
1 answer

How to convert DFM into dataframe BUT keeping docvars?

I am using the quanteda package and the very good tutorials that have been written about it to make various operations on paper articles. I obtained the frequency of specific words over time by selecting them in a mainwordsDFM and using…
2
votes
2 answers

Quanteda dfm_lookup using dictionaries with multi-word patterns/expressions

I am using a dictionary to identify usage of a particular set of words in a corpus. I have included multi-word patterns in the dictionary, however, I don't think dfm_lookup (from the quanteda package) matches multi-word expressions. Does anyone know…
MCC89
  • 57
  • 3
2
votes
1 answer

Seeing metadata/docvars associated with STM topics

I am new to text analysis and am stuck on a question that doesn't seem to be answered in the documentation (or at least, I can't find it). I have created an STM in R from a Quanteda DfM which has docvars associated to it. The topics are based on…
Marina W.
  • 91
  • 2
  • 10
2
votes
1 answer

textstat_keyness for POS, not words

textstat_keyness in Quanteda is used to compare the relative frequency of WORDS/LEMMAS in two (sub)corpora. But I want to compare parts of speech--not words. Then I want to plot it. I've been able to use textstat_keyness for words, no problem, using…
dfayers
  • 35
  • 4
2
votes
1 answer

Why is Quanteda not removing words?

I am having trouble removing profanities from my n-grams. The getProfanityWords function below correctly creates a character vector. The whole script works in every other way, but the profanities remain. I did wonder whether it was to do with the…
Chris
  • 1,449
  • 1
  • 18
  • 39
2
votes
1 answer

Interpretation of dfm_weight(scheme='prop') with groups (quanteda)

I'm looking at the different weighting options using the dfm_weight. If I select scheme = 'prop' and I group textstat_frequency by location, what's the proper interpretation of a word in each group? Say in New York the term career is 0.6 and and in…
Ted Mosby
  • 1,426
  • 1
  • 16
  • 41
2
votes
1 answer

Quanteda R: How to remove numbers or symbols "from"/"in" a token?

I have a question regarding the language pre-processing in quanteda R. I want to generate a document-feature matrix based on some documents. So, I generated a corpus and run the following code. data <- read.csv2("abstract.csv", stringsAsFactors =…
Hu_Ca
  • 47
  • 1
  • 5