0

I am working on what is to me a very new domain in data science and would like to know if anyone can suggest any existing academic literature that has relevant approaches that address my problem.

The problem setting is as follows: I have a set of named topics (about 100 topics). We have a document tagging engine that tags documents (news articles in our case) based on their text with up to 5 of these 100 topics.

All this is done using fairly rudimentary similarity metrics (each topic is a text vector and so is each document and we do a similarity between these vectors and assign the 5 most similar topics to each document).

We are looking to improve the quality of this process but the constraint is we have to maintain the set of 100 named topics which are vital for other purposes so unsupervised topic models like LDA are out because: 1. They don't provide named topics 2. Even if we are able to somehow map distributions of topics output by LDA to existing topics, these distributions will not remain constant and vary with the underlying corpus.

So could anyone point me towards papers that have worked with document tagging using a finite set of named topics?

There are 2 challenges here: 1. Given a finite set of named topics , how to tag new documents with them? (this is the bigger more obvious challenge) 2. How do we keep the topics updated with the changing document universe? Any work that addresses one or both of these challenges would be a great help.

P.S. I've also asked this question on Quora if anyone else is looking for answers and would like to read both posts. I'm duplicating this question as I feel it is interesting and I'd like to get as many people talking about this problem as possible and as many literature suggestions as possible.

Same Question on Quora

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Nikhil
  • 545
  • 1
  • 7
  • 18

1 Answers1

1

Have you tried classification?

Train a classifier for each topic.

Tag with the 5 most likely classes.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Yes that is currently what we are trying and results are decent. But the thing is, having something like 100 classifiers (1-vs-rest) requires manually generating ground-truth of 100 topic datasets and it is very expensive to maintain meaning it requires a lot of constant updates every so often. So I was trying to see if anyone had automated this process in some other way? @Anony-Mousse – Nikhil Sep 13 '15 at 18:24
  • You only need one training set, and you can derive the 100 sets from this easily. Any learning approach *will* need such trainig data - it needs to learn word distributions. There are classifiers for data streams that can continuously learn; but you will need to continuously steer this, to avoid performance degradation *and* to adopt more quickly to new topics and changes. Say the president elections are over, there *is* a new head of government. Articles containing Clinton are now the topic "government", and no longer "elections" and Obama is no longer "government" but just regular "politics" – Has QUIT--Anony-Mousse Sep 13 '15 at 20:26
  • Do you have examples of such document streaming classification systems? A good publication regarding this if you know of any would be a great starting point for me to expand my search. @Anony-Mousse – Nikhil Sep 14 '15 at 00:07
  • Not of systems but I have seen a number of algorithms published. Look for e.g. Hoeffding Trees. But there should bee plenty of results when you search for "streaming classification" or "online classification". But *do* consider simply retraining the classifier every day. This gives you a wider choice and much faster execution. – Has QUIT--Anony-Mousse Sep 14 '15 at 04:52