0

Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?

I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.

Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).

Denis Kulagin
  • 8,472
  • 17
  • 60
  • 129
  • 1
    I'm no expert on those systems but unless you provide some definition of what the summary should look like how should those systems know where to look for matches? I'd either provide some summary field that is searched or do a query on the entire document. – Thomas Feb 15 '17 at 16:23
  • 1
    in general - yes, you could apply some techiniques, but i think your question is very broad, could you be a little bit more specific? – Mysterion Feb 15 '17 at 16:42
  • Updated the question. – Denis Kulagin Feb 15 '17 at 16:50
  • Well most such engines will use an Inverted index. Well know Sphinx does anyway. It creates this special index (which is kind of the term frequency), and doesn't (by default) store the raw text. Can drastically cut down the size of the index by excluding popular words (think 'the') via stopwords. Otherwise your statements 'dont need to search inside' and 'best matching the query' seem contradictory, how can find best matching, without searching the document text?? – barryhunter Feb 15 '17 at 17:43

3 Answers3

1

Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank, big article on the wiki, tons of implementation available in NTLK, and others. However, it will not help you with the querying, you will need to index it anyway somewhere.

I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.

References for MLT in Elasticsearch, Lucene, Solr

Mysterion
  • 9,050
  • 3
  • 30
  • 52
1

but only it's "summary" (you may call it index information or TD-IDF representation).

What you are looking for seems quite standard :

  • Apache Lucene [1], if you look for a library
  • Apache Solr or Elastic Search, if you are looking for a production ready Enterprise Search Server.

How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).

What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .

In Lucene and Solr this is matter of configuration.

Summarisation is a completely different NLP task and is not probably what you need.

Cheers

[1] http://lucene.apache.org/index.html

[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/

1

Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).

To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:

 org.apache.lucene.document.Field.Store.NO

By saving 'summary' as field org.apache.lucene.document.TextField, the summary will be indexed and tokenized. It will store the TD-IDF information for you to search.

XL Zheng
  • 363
  • 1
  • 6
  • 14