stanza: add double linebreaks in list of articles (batching to speed up stanza nlp pipeline)

Question

I have a list of articles and want to apply stanza nlp to them, but running the code (in Google Colab), stanza never finishes. Online (https://github.com/stanfordnlp/stanza) I found that separating the documents (in my case, articles in a list) with double linebreaks helps with speeding up the process, but my code for that doesn't seem to work.

Code before trying to add linebreaks (without all the import lines):

file = open("texts.csv", mode="r", encoding='utf-8-sig')
data = list(csv.reader(file, delimiter=','))
file.close

pickle.dump(data, open('List.p', 'wb'))

stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,POS', use_gpu=True)

data_list = pickle.load(open('List.p', 'rb'))
new_list = []

for article in data_list:
   a = nlp(str(article))
   new_list.append(a) ### here the code runs forever and doesn't finish
pickle.dump(new_list, open('Annotated.p', 'wb'))

This code is followed by a code for topic modeling. I tried the code above and the topic modeling code with a smaller dataset (327 KB) and had no trouble whatsoever, but the size of the csv file (3.37 MB) seems to be a problem...

So I tried the following lines of code:

data_split = '\n\n'.join(data)

This gives me the error "TypeError: sequence item 0: expected str instance, list found"

data_split = 'n\n\'.join(map(str, data))

Printing the first item of the list (data_split[0]) gives me "[" and nothing else.

I also played around with looping through the articles of the list 'data', creating a new list and appending it, but that also dind't work.

Maybe there are also other ways of speeding up stanza when using large datasets?

stanza: add double linebreaks in list of articles (batching to speed up stanza nlp pipeline)

0 Answers0