2

Intro

Currently I am using Gensim in combination with pandas and numpy to run document NLP computation. I'd like to build a LDA seqential model to track how our topics change over time but am running into errors with the corpus format.

I am trying to figure out how to set time slices for dynamic topic models. I am using LdaSeqModel which requires an integer time slice.

The Data

It's a csv:

data = pd.read_csv('CGA Jan17 - Mar19 Time Slice.csv', encoding = "ISO-8859-1");
documents = data[['TextForTopics']]
documents['index'] = documents.index

        Month Year Begin Date TextForTopics                                       time_slice
0 march 2017 3/23/2017 request: the caller is requesting an appointme... 1

This is then converted into an array of tuples called the bow_corpus:

[[(12, 2), (25, 1), (30, 1)], [(33, 1), (136, 1), (159, 1), (161, 1)], [(165, 1), (247, 2)], (326, 1), (354, 1), (755, 1), (821, 1)]]

Desired Output

It should print one topic allocation for each time slice. If I entered 3 topics and two time slices I should get three topics printed twice showing how the topics evolved over time.

[(0,
  '0.165*"enrol" + 0.108*"medicar" + 0.051*"form"),
(1,
  '0.303*"caller" + 0.290*"inform" + 0.031*"abl"),
(2,
  '0.208*"date" + 0.140*"effect" + 0.060*"medicaid"')]
[(0,
  '0.165*"enrol" + 0.108*"cats" + 0.051*"form"),
(1,
  '0.303*"caller" + 0.290*"puppies" + 0.031*"abl"),
(2,
  '0.208*"date" + 0.140*"elephants" + 0.060*"medicaid"')]

What I've tried

This is the function - the bow corpus is an array of tuples

ldaseq = LdaSeqModel(corpus=bow_corpus, time_slice=[], num_topics=15, chunksize=1)

I've tried every version of integer inputs for those time_slices and they all produce errors. The premise was that the time_slice would represent the number of indicies/rows/documents in each time slice. For example my data has 1.8 million rows if I wanted two time slices I would order my data by time and enter an integer cutoff like time_slice = [489234, 1310766]. All inputs produce this error:

The Error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-e58059a7fb6f> in <module>
----> 1 ldaseq = LdaSeqModel(corpus=bow_corpus, time_slice=[], num_topics=15, chunksize=1)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in __init__(self, corpus, time_slice, id2word, alphas, num_topics, initialize, sstats, lda_model, obs_variance, chain_variance, passes, random_state, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
    186 
    187             # fit DTM
--> 188             self.fit_lda_seq(corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
    189 
    190     def init_ldaseq_ss(self, topic_chain_variance, topic_obs_variance, alpha, init_suffstats):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in fit_lda_seq(self, corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
    275             # seq model and find the evidence lower bound. This is the E - Step
    276             bound, gammas = \
--> 277                 self.lda_seq_infer(corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)
    278             self.gammas = gammas
    279 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in lda_seq_infer(self, corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)
    351             bound, gammas = self.inferDTMseq(
    352                 corpus, topic_suffstats, gammas, lhoods, lda,
--> 353                 ldapost, iter_, bound, lda_inference_max_iter, chunksize
    354             )
    355         elif model == "DIM":

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in inferDTMseq(self, corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter, chunksize)
    401         time = 0  # current time-slice
    402         doc_num = 0  # doc-index in current time-slice
--> 403         lda = self.make_lda_seq_slice(lda, time)  # create lda_seq slice
    404 
    405         time_slice = np.cumsum(np.array(self.time_slice))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in make_lda_seq_slice(self, lda, time)
    459         """
    460         for k in range(self.num_topics):
--> 461             lda.topics[:, k] = self.topic_chains[k].e_log_prob[:, time]
    462 
    463         lda.alpha = np.copy(self.alphas)

IndexError: index 0 is out of bounds for axis 1 with size 0

Solutions

I tried going back to the documentation and looking at the format of the common_corpus used as an example and the format of my bow_corpus is the same. I also tried running the code in the documentation to see how it worked but it also produced the same error. I'm not sure if the problem is my code anymore but I hope it is.

I've also tried messing with the file format by manually dividing my csv into 9 csvs containing my time_slices and creating an iterated corpus out of those, but that didn't work. I've considered converting each row of my csv into txt files and then creating a corpus out of that like David Beil does, but that sounds pointlessly tedious as I already have an iterated corpus.

Sara
  • 1,162
  • 1
  • 8
  • 21

2 Answers2

2

I'm going to assume you are working in a single dataframe. Let's say you want to use years as your unit of time.

  1. For time_slice to work properly with ldaseqmodel you need to first order your dataframe ascending, i.e. from oldest to newest.
  2. Create a time_slice variable so you can later feed it back into the model
import numpy as np
uniqueyears, time_slices = np.unique(data.Year, return_counts=True) 
#takes all unique values in data.Year as well as how often they occur and returns them as an array.

print(np.asarray((uniqueyears, time_slices)).T) 
#see what youve made, technically you dont need this

returns (using example data)

[[1992   28]
 [1993   18]
 [1994   25]
 [1995   18]
 [1996   44]
 [1997   38]
 [1998   30]]

This works for years, if you want to go more fine-grained, you could adapt the same concept, as long as you have the ordering of the documents (which is how gensim connects them to time slices) right. So, for example if you want to take monthly slices, you could rewrite the dates as 20173 for March 2017 and 20174 for April 2014. Really, any grain will do as long as you can identify documents as belonging to the same slice.

jhl
  • 671
  • 6
  • 23
  • I sorted the list and created one file. I counted the rows in each time slice and it worked: %%time ldaseq = LdaSeqModel(corpus=bow_corpus, time_slice=[41080, 40439, 35850, 35311, 40392, 38877, 37188, 33306, 10924], num_topics=9) – Sara Aug 15 '19 at 19:44
0

time_slice (list of int, optional) – Number of documents in each time-slice. Each time slice could for example represent a year’s published papers, in case the corpus comes from a journal publishing over multiple years. It is assumed that sum(time_slice) == num_documents. gensimdocs

In your Code the time slice argument is entered as an empty list

time_slice=[]

Which is throwing the traceback listed in your question.

I'm not exactly familiar with your data thus I can't tell you what to put in for the time slice argument.

However here is an example from the docs.

Suppose your corpus has 30 documents, with 5 in the first time-slice, 10 in the second, and 15 in the third.

Your time_slice argument is time_slice=[5,10,15]

Depending on your data you may want to generate the time_slice list directly from your data.

Does that clear things up at all?

ZdWhite
  • 501
  • 1
  • 3
  • 15
  • You're right, but this doesn't help me build an iterable corpus. I don't know what the input for time slices should look like. I have a csv, it has dates, I've taken that csv and created nine csvs divided by dates. Cool now I have 10 files and don't know how to turn them into the format the algorithm wants. All the examples use pre-built data sets that come with gensim. I can't find examples of time_slices built from scratch or especially where the 'documents' are rows in an excel file. – Sara Jul 29 '19 at 21:54
  • 1
    Using Gensim you do not need to make multiple CSVs unless your corpus can't fit in RAM. – ZdWhite Jul 30 '19 at 18:10
  • 1
    > I can't find examples of time_slices built from scratch You need to look at your data and decide how many documents fit with in each time period. >where the 'documents' are rows in an excel file take a look at my question answered by gojomo. https://stackoverflow.com/questions/56681210/convert-a-column-in-a-dask-dataframe-to-a-taggeddocument-for-doc2vec Ignore the dask part. That code he gives shows how to build an iterable corpus. In my case I was using Doc2Vec however you should be able to replace that with your needs for LDA – ZdWhite Jul 30 '19 at 18:18