I have a list of articles and want to apply stanza nlp to them, but running the code (in Google Colab), stanza never finishes. Online (https://github.com/stanfordnlp/stanza) I found that separating the documents (in my case, articles in a list) with double linebreaks helps with speeding up the process, but my code for that doesn't seem to work.
Code before trying to add linebreaks (without all the import lines):
file = open("texts.csv", mode="r", encoding='utf-8-sig')
data = list(csv.reader(file, delimiter=','))
file.close
pickle.dump(data, open('List.p', 'wb'))
stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,lemma,POS', use_gpu=True)
data_list = pickle.load(open('List.p', 'rb'))
new_list = []
for article in data_list:
a = nlp(str(article))
new_list.append(a) ### here the code runs forever and doesn't finish
pickle.dump(new_list, open('Annotated.p', 'wb'))
This code is followed by a code for topic modeling. I tried the code above and the topic modeling code with a smaller dataset (327 KB) and had no trouble whatsoever, but the size of the csv file (3.37 MB) seems to be a problem...
So I tried the following lines of code:
data_split = '\n\n'.join(data)
This gives me the error "TypeError: sequence item 0: expected str instance, list found"
data_split = 'n\n\'.join(map(str, data))
Printing the first item of the list (data_split[0]) gives me "[" and nothing else.
I also played around with looping through the articles of the list 'data', creating a new list and appending it, but that also dind't work.
Maybe there are also other ways of speeding up stanza when using large datasets?