0

I want to create an empty corpus in textacy and later on fill it up with data via

corpus.add(doc)

But everytime I try to create an empty corpus I am not able to save it and instead I get this error:

IndexError: list index out of range

I tried both not giving any data when creating the corpus or giving None as data:

corpus = textacy.Corpus(lang=locale)
corpus = textacy.Corpus(lang=locale, data=None)
corpus.save(path) # this line results in the index error

It would be nice if anybody could help me :)

aAnnAa
  • 135
  • 2
  • 10

1 Answers1

0

I have just tried this out myself. What is locale exactly? I have performed the following:

  1. created spacy language object for german language with

nlp = spacy.load("de_core_news_lg")

  1. and then passed it to

corpus = textacy.Corpus(nlp)

After that I was able to iterate through my documents and add them item per item.

However, I would not recommend doing this. I have performed two scenarios to process 15k short comments:

  • I first preprocessed my documents as a list and put it directly into textacy.Corpus(nlp, data=preprocessed_list). That took me around 22 s.
  • Performing the same logic, but by creating an empty corpus and adding each one item to it lasted 1 min 26 s.
Dharman
  • 30,962
  • 25
  • 85
  • 135
chAlexey
  • 692
  • 8
  • 13