What's the best way to compare several corpora in natural language?

Question

I've been doing LDA topic models of narrative reports in natural language for a research project (using Gensim with python). I have several smallish corpora (from 1400 to 200 docs each – I know, that's tiny!) that I'd like to compare, but I don't know how to do that beyond looking at each LDA model (for instance with pyLDAviz). My academic background is not in CS, and I'm still a bit new to NLP.

What are some good ways to compare topics across corpora/topic models? For instance, is it possible to estimate how much two LDA models overlap? Or are there other ways to assess the topic similarity of several corpora?

Thanks in advance for your help!

Sir Cornflakes · Answer 1 · 2017-09-06T16:52:59.897

2

Join the corpora in one big corpus, do a topic model with parameters that deem good to you, and than compare how the topics are distributed among the subcorpora.

This is the only clean method I know about. Note that different random seeds produce different topic model with all other parameters fixed; there is no such thing as the topic model of a corpus.

An example (where the subcorpora are the different years of publication of scientific papers) can be found in this abstract (Full citation:

@InProceedings{fankhauser-etal2016,
Title                    = {Topical Diversification over Time in the {R}oyal {S}ociety {C}orpus },
Author                   = {Peter Fankhauser and J{\"o}rg Knappen and Elke Teich},
Booktitle                = {Proceedings of DH  2016},
Year                     = {2016},
Address                  = {Krakow, Poland},
Month                    = {July 12-16},
url                      = {http://dh2016.adho.org/abstracts/322},
}

).

edited Sep 06 '17 at 16:52

answered Sep 05 '17 at 12:01

Sir Cornflakes

675
13
26

Thanks! Do you have a tutorial or example of how to compare how the topic are distributed among the subcorpora? Given the documents belonging to each subcorpus are not labeled/tagged with the name of their subcorpus in the model, I'm not sure how to do this. Any help would be greatly appreciated! – Paul Miller Sep 06 '17 at 14:40
@PaulMiller: I have added one example study that I have co-authored in my answer. Of course you will have to do some bookkeeping (e.g., by maintaining lists of documents belonging to the respective subcorpora). For the statistics, we use R and python, but you can chose any tool you prefer. – Sir Cornflakes Sep 06 '17 at 15:02
1

I think theres no correct mathematical way (yet) to compare topics trained on different corpora. However if you did group all your documents together into one corpus and train on that, you could very easily find the similarity between documents (by their topic distribution) in that corpus using the [Jenson Shannon distance](https://stackoverflow.com/questions/15880133/jensen-shannon-divergence) – PyRsquared Sep 20 '17 at 20:33
Hi @PaulMiller, have you got any updates on how you did this? – Mtrinidad May 28 '21 at 04:03

What's the best way to compare several corpora in natural language?

1 Answers1