How do I combine usage of db4o to store data and Lucene to index data for fast search?

Question

I'm new to both db4o and Lucene.

Currently I'm using db4o to persist my data on an Android app. I need the capability to perform quick searches, as well as provide suggestions to the user (e.g., auto complete suggestions).

An SO poster mentioned using Lucene to index data and db4o to store it.

Has anyone implemented this approach ? If yes, I would appreciate if they share the overall approach? What are the alternatives?

I would go to use lucene only as datastorage. no need for db4o or why would you use one? (just store the doc as json into a stored and none-indexed probably compressed field) — Karussell, Oct 19 '11 at 20:49

Sam Stainsby · Accepted Answer · 2011-05-02T22:34:06.570

3

I used Lucene to extract keywords from items to be stored in the database and store what I call 'keyword extension' objects that point to the corresponding domain objects. This made the domain objects findable by keyword (also allowing for stemming), and separated the keywords concerns. The database was built from a large static dataset (the USDA food nutrient database), so I didn't need to worry about changes during runtime. Thus this solution is limited in its current form ...

The first part of the solution was to write a small chunk of code that takes some text and extracts both the keywords and corresponding stems (using Lucene's 'Snowball' stemming) into a map. You use this to extract the keywords/stems from some domain objects that you are storing in the database. I kept the original keywords around so that I could create some sort of statistics on the searches made.

The second part was to construct objects I called 'keyword extensions' that store the stems as an array and the corresponding keywords as another array and have a pointer to the corresponding domain objects that had the keywords (I used arrays because they work more easily with DB4O). I also subclassed my KeywordExtension class to correspond to the particular domain objects's type - so for example I was storing a 'Nutrient' domain object and a corresponding 'NutrientKeywordExtension' object.

The third part is to collect the user's entered search text, again use the stemmer to extract the stems, and search for the NutrientKeywordExtension objects with those stems. You can then grab the Nutrient objects that those extensions point to, and finally present them as search results.

As I said, my database was static - it's created the first time the application runs. In a dynamic database, you would need to worry about keeping the nutrients and corresponding keyword extensions in sync. One solution would be to merge the nutrient and nutrient keyword extension into one class if you don't mind having that stuff inside your domain objects (I don't like this). Otherwise, you need to account for keyword extensions every time your create/edit/delete your domain objects.

I hope this limited example helps.

edited May 02 '11 at 22:34

answered Apr 30 '11 at 22:03

Sam Stainsby

1,588
1
10
19

@Sam - thanks for responding. Can you give me an idea of the size of the index and how much time it took to build the initial index on the phone. – Soumya Simanta May 03 '11 at 14:04
@Soumyama the indexes in this case are embodied by the set of KeywordExtension objects. There is a lot more data in the database, and I haven't worked out what space these particular objects take up. The majority of space I suspect is taken up the the 555,726 nutrient entry objects in any case, leading to a 45 MB database file. This is all on a Granite web application (Granite is our own open source Scala/Wicket/DB4O stack), not on a phone. It takes just over a minute on a 6-core desktop to generate the entire DB4O database from scratch. – Sam Stainsby May 04 '11 at 01:21
@Sam - that's helpful information. 45 MB is the DB4O db file size or the size of the Lucene index ? – Soumya Simanta May 04 '11 at 01:56
@Soumya 45 MB is the total DB4O db file size – Sam Stainsby May 04 '11 at 05:28
@Sam - thanks. Can you please tell me the size of the Lucene index ? – Soumya Simanta May 04 '11 at 15:03
@Soumya As I said before, "the indexes in this case are embodied by the set of KeywordExtension objects" ... "and I haven't worked out what space these particular objects take up. ". It is a fraction of the database size, but without further work, I don't know what fraction. All I can say is that there is here is one extension object per domain object. – Sam Stainsby May 04 '11 at 23:43
@Sam Is this an open source project? Are you going to release some code? The db4o community could benefit from what you did (it's awesome) – German May 20 '11 at 15:13
Not currently open source - it was really a test case for our Granite framework (which is open source). Will think about what to do with it. – Sam Stainsby May 21 '11 at 11:22
@German - you can now see a basic web UI for the food nutrient database here: http://nutrients.ofthings.net/ – Sam Stainsby Aug 12 '11 at 03:23

How do I combine usage of db4o to store data and Lucene to index data for fast search?

1 Answers1