0

I would like to know if it is possible to delete documents from a corpus if the text is in fact "empty". I am building a corpus of texts in order to subsequently run some textmodels using quanteda package in R. The texts are in a column of a csv file and are imported as follows:

> mycorpus<-corpus(readtext("tablewithdocuments.csv",text_field="textcolumn"))
> mycorpus
Corpus consisting of 25 documents and 14 docvars.

I know how to erase empty texts from the dfm of the corpus, but I want to have a new corpus which is a subset of the original one excluding documents with a missing cell in the csv column "textcolumn".

In practice, from something as the following corpus:

library("quanteda")

text <- c(
  doc1 = "",
  doc2 = "pinapples and pizzas taste good",
  doc3 = "but please do not mix them together"
)
mycorpus <- corpus(text)

mycorpus
## Corpus consisting of 3 documents and 0 docvars.

summary(mycorpus)
## Corpus consisting of 3 documents:
## Text Types Tokens Sentences
## doc1     0      0         0
## doc2     4      4         1
## doc3     5      5         1

I would like to obtain a new corpus with only doc2 and doc3 in it.

Thank you in advance for you help.

Best wishes,

Michele

  • See `?corpus_subset`. You would want something like `corpus_subset(mycorpus, textcolumn = "")`. But impossible to answer without a more reproducible example and a better explanation of the nature of your docvars and expected output. – Ken Benoit Aug 09 '19 at 20:04
  • Dear Professor Benoit, thanks for your prompt answer and sorry for not being clear enough. Your suggestion addessed my first point about selecting texts according to specific docvars but I think the major point of my question remains unanswered (my fault for not being clear). I re-edited the question above and I hope it is much clearer now what I want to do. As you can see it is a quite straightforward thing. Thanks for your help! – Michele Scotto Aug 09 '19 at 21:32

0 Answers0