I am trying to analyze news snippets in order to identify crisis periods. To do so, I have already downloaded news articles over the past 7 years and have those available. Now, I am applying a LDA (Latent Dirichlet Allocation) model on this dataset in order to identify those countries show signs of an economic crisis.
I am basing my code on a blog post by Jordan Barber (https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) – here is my code so far:
import os, csv
#create list with text blocks in rows, based on csv file
list=[]
with open('Testfile.csv', 'r') as csvfile:
emails = csv.reader(csvfile)
for row in emails:
list.append(row)
#create doc_set
doc_set=[]
for row in list:
doc_set.append(row[0])
#import plugins - need to install gensim and stop_words manually for fresh python install
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=10)
print(ldamodel.print_topics(num_topics=5, num_words=5))
# map topics to documents
doc_lda=ldamodel[corpus]
with open('doc_lda.csv', 'w') as outfile:
writer = csv.writer(outfile)
for row in doc_lda:
writer.writerow(row)
Essentially, I identify a number of topics (5 in the code above – to be checked), and using the last line I assign each news article a score, which indicates the probability of an article being related to one of these topics. Now, I can only manually make a qualitative assessment of whether a given topic is related to a crisis, which is bit unfortunate. What I would much rather do, is to tell the algorithm whether an article was published during a crisis and use this additional piece of information to identify both topics for my “crisis years” as well as for my “non-crisis-years”. Simply splitting my dataset to just consider topics for my “bads” (i.e. crisis years only) won’t work in my opinion, as I would still need to manually select which topics would actually be related to a crisis, and which topics would show up anyways (sports news, …).
So, is there a way to adapt the code to a) incorporate the information of “crisis” vs “non-crisis” and b) to automatically chose the optimal number of topics / words to optimize the predictive power of the model?
Thanks a lot in advance!