4

I need a solution for identifying incorrect chapter headings in a book.

We are developing an ingestion system for books that does all sorts of validation, like spell-checking and offensive-language-filtering. Now we'd like to flag chapter headings that seem inaccurate given the chapter body. For example, if the heading was "The Function of the Spleen", I would not expect the chapter to be about the liver.

I am familiar with fuzzy string matching algorithms but this seems like more like an NLP or classification problem. If I could match (or closely match) the phrase "function of the spleen", then that's great -- high confidence. Otherwise, a high occurrence of both "function" and "spleen" in the text also yields confidence. And of course, the closer they are together the better.

This needs to be done in-memory, on the fly, and in Java.

My current naive approach is to simply tokenize all the words, remove noise words (like prepositions), stem what's left, and then count the number of matches. At a minimum I'd expect each word in the heading to appear at least once in the text.

Is there a different approach, ideally one that would take into account things like proximity and ordering?

MrLore
  • 3,759
  • 2
  • 28
  • 36

1 Answers1

1

I think that it is a classification problem, as such take a look at WEKA

Yaneeve
  • 4,751
  • 10
  • 49
  • 87
  • WEKA is great, thanks! I've also been looking at other similar solutions and the problem is: they all require a training set. But in this case, I don't have one. Just the one chapter text and a title. So how do I create a classifier from just one sample? I can't find any information on this. I am considering: assuming the chapter is consistent and about a focused topic, simply chop it up into mini-documents and train on those? But I can't see this being done anywhere, so perhaps there's a reason why it's inherently futile? – Jesse Harris Dec 01 '13 at 19:48
  • I had assumed that you are going to parse an immense number of books, as such I think that you will probably have more that one title dealing with similar subject matter. You could, if that is the case, choose a portion of those books to be a training set (obviously you would need a human to classify them). You could also add to the training set using human reassessment of your classification algorithm. I would naively use the KNN algorithm... I had encountered a problem like yours in the past, a team I had worked in parallel to solved it via classification algorithm. I don't remember how... – Yaneeve Dec 02 '13 at 12:22
  • Marking as best answer, as Yaneeve provided a perfect solution, if I were able to parse a large amount of books (as opposed to one-at-a-time). – Jesse Harris Dec 16 '13 at 19:17