0

I am building a question answering system, and to speed up the process I want an IR system to return a set of documents from a corpus likely to hold the answer to that question (and my NLP algorithm will try to figure out the answer from the full text of those).

Since I'm using Python, Whoosh seemed like a good choice, but I'm having a difficult time searching in a method other than pure boolean queries, which don't lend themselves to question answering. I'd like something like a list of documents with high TF-IDF similarity to a string query.

I'd like to input:

"Who is the president of the United States?"

and get the most similar documents, but instead I just strip out stopwords to have:

"president OR united OR states"

The exactness doesn't lend itself to a QA process. Can anyone point me to some methods or advanced API methods to get top documents in a non-boolean way? I'd be willing to try other libraries, but most seem complex to interface quickly with Python, and I was hoping to have something super easy so I could move on to focus on the natural language component.

Chet
  • 21,375
  • 10
  • 40
  • 58
  • have you tried pylucene? it has built in document similarity that works more or less like that. – Not_a_Golfer Apr 24 '12 at 20:30
  • @Not_a_Golfer, I might give that a shot, but I found it to be frustrating to set up in windows. I'm on linux, but most of my team is not. – Chet Apr 24 '12 at 20:46

0 Answers0