1

I have a corpus of documents, which I have already tagged. I have fixed list of about 400 tags - relating to different topics. Each document has been tagged with one or more tags, and a short title. (I also have a much larger list of titles - which I often re-use if the document contains very similar content)

I want to make an interface that will suggest tags/titles (from my existing lists) for new documents that I add to the corpus, based on how I have tagged the existing documents.

I have read about the probabilistic topic model LDA classes, which look great for analyzing text when you don't have any existing tagged data. But I don't see any way I can incorporate my existing work.

Any suggestions would be appreciated.

Kind Regards

Swami

swami
  • 673
  • 1
  • 9
  • 18

1 Answers1

0

For tags suggestion, our experience is just using a search engine, no need for topic modeling.

Try below steps:

  • Setup an index on title and abstract of all your documents
  • Using the title or abstract of the new document as a query to search on the index, and a list of similar document can be achieved.
  • Using the first few most-similar documents from the list, we aggregate all the tags on them as a tag bundle
  • Sort the tags bundle by frequency of each tag, and the first most-frequent tags are the final result

This solution is workable.

Mountain
  • 211
  • 3
  • 11
  • Our "documents" are in fact user selections from real documents. They have no abstract. The word count may vary from 500 to 5000 words, and the topic content may be completely unrelated even if the source doc is the same. Is it possible to use the document text itself as the query? We're using SOLR. I imagine that for a 5000 word query it would take too much processing power unless we did some really aggressive stop word removal. Would like to know your thoughts on this Mountain. – swami Jul 31 '13 at 05:12