Stipulation of "Good"/"Bad"-Cases in an LDA Model (Using gensim in Python)

Question

I am trying to analyze news snippets in order to identify crisis periods. To do so, I have already downloaded news articles over the past 7 years and have those available. Now, I am applying a LDA (Latent Dirichlet Allocation) model on this dataset in order to identify those countries show signs of an economic crisis.

I am basing my code on a blog post by Jordan Barber (https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) – here is my code so far:

import os, csv

#create list with text blocks in rows, based on csv file
list=[]

with open('Testfile.csv', 'r') as csvfile:
    emails = csv.reader(csvfile)
    for row in emails:
         list.append(row)

#create doc_set
doc_set=[]

for row in list:
    doc_set.append(row[0])

#import plugins - need to install gensim and stop_words manually for fresh python install
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()


# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:

    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)


# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=10)

print(ldamodel.print_topics(num_topics=5, num_words=5))

# map topics to documents
doc_lda=ldamodel[corpus]

with open('doc_lda.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    for row in doc_lda:
        writer.writerow(row)

Essentially, I identify a number of topics (5 in the code above – to be checked), and using the last line I assign each news article a score, which indicates the probability of an article being related to one of these topics. Now, I can only manually make a qualitative assessment of whether a given topic is related to a crisis, which is bit unfortunate. What I would much rather do, is to tell the algorithm whether an article was published during a crisis and use this additional piece of information to identify both topics for my “crisis years” as well as for my “non-crisis-years”. Simply splitting my dataset to just consider topics for my “bads” (i.e. crisis years only) won’t work in my opinion, as I would still need to manually select which topics would actually be related to a crisis, and which topics would show up anyways (sports news, …).

So, is there a way to adapt the code to a) incorporate the information of “crisis” vs “non-crisis” and b) to automatically chose the optimal number of topics / words to optimize the predictive power of the model?

Thanks a lot in advance!

This isn't entirely appropriate for SO, as it's not so much a programming question as a data analysis question, though I'm not sure where it'd fit better... — drevicko, Aug 10 '16 at 10:14

score 0 · Answer 1 · answered Aug 10 '16 at 10:13

First some suggestions on your specific questions:

a) incorporate the information of “crisis” vs “non-crisis”

To do this with a standard LDA model, I'd probably go for mutual information between doc topic proportions and whether docs are in a crisis/non crisis period.

b) to automatically chose the optimal number of topics / words to optimize the predictive power of the model?

If you want to do this properly, experiment with many settings for the number of topics and try to use the topic models to predict conflict/nonconflict for held out documents (documents not included in the topic model).

There are many topic model variants that effectively choose the number of topics ("non-parametric" models). It turns out that the Mallet implementation with hyperparameter optimisation effectively does the same, so I'd suggest using that (provide a large number of topics - hyperparameter optimisaiton will result in many topics with very few assigned words, these topics are just noise).

And some general comments:

There are many topic model variants out there, and in particular a few that incorporate time. These may be a good choice for you (as they'll better resolve topic changes over time than standard LDA - though standard LDA is a good starting point).

One model I particularly like uses pitman-yor word priors (better matching zipf distributed words than a dirichlet), accounts for burstiness in topics and provides clues on junk topics: https://github.com/wbuntine/topic-models

Stipulation of "Good"/"Bad"-Cases in an LDA Model (Using gensim in Python)

1 Answers1