25

Am building a "Book search" API using Lucene. I need to index Book Name,Author, and Book category fields in Lucene index.

A single book can fall under multiple distinct book categories...for example:

BookName1 --fiction,humour,philosophy. BookName1 --fiction,science. BookName1 --humour,business. BookName4-humour and so on.....

User should be able to search all the books under a particular category say "homour".

Given this situation, how do i index above fields and build the query in lucene?

user40907
  • 1,532
  • 5
  • 24
  • 33

3 Answers3

33

You can have a field for a Lucene document occur multiple times. Create the document, add the values for the the name and author, then do the same for each category

  • create new lucene document
  • add name field and value
  • add author field and value
  • for each category:
    • add category field and value
  • add document to index

When you search the index for a category, it will return all documents that have a category field with the value you're after. The category should be a 'Keyword' field.

I've written it in english because the specific code is slightly different per lucene version.

Doug
  • 494
  • 8
  • 14
  • wont this create multiple documents? what happens if you have 3 million records? and each book has 3-5 categories ? you are having between 9-15 million records.. i wonder if there is some other way of acheiving the same. – Rafael Herscovici Oct 15 '11 at 15:15
  • 3
    No, you would only have one document. It isn't like a database where you manage the schema AND the index. You have to relax and let Lucene handle the index, it's really clever stuff. – Doug Nov 22 '11 at 21:02
  • This won't work in Zend_Search_Lucene: the source for Zend_Search_Lucene_Document::addField( $field ) { $this->_fields[$field->name] = $field; return $this; } – Steve Feb 21 '12 at 09:13
  • 2
    How to test if all the categories are saved? When i write a query, i only get the first category returned for a doc. – trillions Jun 08 '12 at 02:45
  • Would the ranking function treat matches in several fields treat similar as matches for several tokens in the same field? – benroth Mar 30 '14 at 16:48
5

You can create a simple "category" field, where you list all categrories for a book seperated by spaces.

Then you can search something like:

stock market AND category:(+"business")

Or if you want to search in more than one category

stock market AND category:(+"business" +"philosophy")
zehrer
  • 1,660
  • 1
  • 14
  • 19
4

I would use Solr instead - it's built on Lucene and managed by the ASF, but is much, much easier to use than Lucene, especially for newcomers.

If offers pretty much all the mainline features of Lucene (certainly everything you'll need for the project you describe), plus extra things like snapshotting, replication, schemas, ...

In Solr, you would simply define the fields you want to index something like this in schema.xml:

<field name="book_id" type="string" indexed="true" stored="true" required="true" multiValued='false'/>
<field name="book_name" type="text" indexed="true" stored="true" required="true" multiValued='false' />
<field name="book_authors" type="text" indexed="true" stored="true" required="true" multiValued='true' />
<field name="book_categories" type="textTight" indexed="true" stored="true" required="true" multiValued='true' />

Note that the multiValued='true' attribute lets you effective pass an array or list to this field, which gets split and indexed nicely by Solr.

Once you have this, start up Solr and you can ask queries like "book_authors:Hemingway" or "book_categories:Romance book_categories:Mills".

There are several query handlers pre-written and configured for you to do things like parse complex queries (fuzzy matches, boolean operations, scoring boosts, ...), and as Solr's API is exposed over HTTP, all this is wrapped by a number of client libraries, so you don't need to handle the low-level details of crafting queries yourself.

There is lots of great documentation on their website to get you started.

James Brady
  • 27,032
  • 8
  • 51
  • 59