9

I have a database of about 200k books. I wish to give my users a way to quickly search a book by the title. Now, some titles might have prefix like A, THE, etc. and also can have numbers in the title, so search for 12 should match books with "12", "twelve" and "dozen" in the title. This will work via AJAX, so I need to make sure database query is really fast.

I assume that most of the users will try to search using some words of the title, so I'm thinking to split all the titles into words and create a separate database table which would map words to titles. However, I fear this might not give the best results. For example, the book title could be some 2 or 3 commonly used words, and I might get a list of books with longer titles that contain all 2-3 words and the one I'm looking for lost like a needle in a haystack. Also, searching for a book with many words in the title might slow down the query because of a lot of OR clauses.

Basically, I'm looking for a way to:

  • find the results quickly
  • sort them by relevance.

I assume this is not the first time someone needs something like this, and I'd hate to reinvent the wheel.

P.S. I'm currently using MySQL, but I could switch to anything else if needed.

Milan Babuškov
  • 59,775
  • 49
  • 126
  • 179
  • correct me if I am wrong, but you are looking for Information Retrieval/Search Engine system, for the books, aren't you? in this case, have a look at Apache Lucene. – amit May 08 '11 at 15:50
  • @amit, "Have a look at Apache Lucene" is like saying "google it". The source code is huge and I have a very specific problem. Could you point me to what exactly I need to look at in Lucene? – Milan Babuškov May 08 '11 at 15:53

5 Answers5

2

Using a SOUNDEX is the best way i think.

SELECT
  id,
  title
FROM products AS p
WHERE p.title SOUNDS LIKE 'Shaw'

// This will match 'Saw' etc.

For best database performances you can best calculate the SOUNDEX value of your titles and put this in a new column. You can calculate the soundex with SOUNDEX('Hello').

Example usage:

UPDATE `books` SET `soundex_title` = SOUNDEX(title);
Milan Babuškov
  • 59,775
  • 49
  • 126
  • 179
Tom Claus
  • 1,301
  • 1
  • 9
  • 13
1

Keep it simple. Create an index on the title field and use wildcard pattern matching. You can not possibly make it any faster as your bottleneck is not the string matching but the number of strings you want to match against the title.

And just came up with a different idea. You say that some words can be interpreted differently. Like 12, Twelve, dozen. Instead of creating a query with different interpretations, why not store different interpretations of the titles in a separate table with a one to many to the books. You can then GROUP BY book_id to get unique book titles.

Say the book "A dime in a dozen". In books table it will be:

book_id=356
book_title='A dime in a dozen'

In titles table will be stored:

titles_id=123
titles_book_id=356
titles_title='A dime in a dozen'
--
titles_id=124
titles_book_id=356
titles_title='A dime in a 12'
--
titles_id=125
titles_book_id=356
titles_title='A dime in a twelve'

The query for this: SELECT b.book_id, b.book_title FROM books b JOIN titles t on b.book_id=t.titles_book_id WHERE t.titles_title='%twelve%' GROUP BY b.book_id

Now, insertions becomes a much bigger task, but creating the variants can be done outside the database and inserted in one swoop.

Mel
  • 6,077
  • 1
  • 15
  • 12
  • Interesting idea, still using %twelwe% requires to read all the records as index cannot be used and not have even have more than at start, so it will be even slower. – Milan Babuškov May 10 '11 at 16:36
  • Honestly, I think that you're more helped by assigning keywords to books and leave the titles alone. You can then simply provide a title search and keyword search to the end user. Title search does LIKE foo% match 'as you type', which MySQL **should** optimize as an indexed search and LIKE '%foo%' on submit. Keyword only matches titles where the complete keyword matches the a row in the keyword table. People who type 'the' into a keyword search shouldn't be allowed to use the internet, but of course you can politely formulate that on your search page. – Mel May 10 '11 at 16:51
1

One solution that would easily accomodate your volume of data and speed requirment is to use the Redis key-value pair store. The way I see it, you can go ahead with your solution of mapping titles to keywords and storing them under the form:

keyword : set of book titles

Redis already has a built-in set data-type that you can use.

Next, to get the titles of the books that contains the search keywords you can use the sinter command which will peform set intersection for you.

Everything is done in memory; therefore the response time is very fast. Also, if you want to save your index, redis has a number of different persistance/caching mechanisms.

Mihai Oprea
  • 2,051
  • 3
  • 21
  • 39
1

You might want to have a look at Apache Lucene. this is a high performance java based Information Retrieval System.
you would want to create an IndexWriter, and index all your titles, and you can add parameters (have a look at the class) linking to the actual book.
when searching, you would need an IndexReader and an IndexSearcher, and use the search() oporation on them.
have a look at the sample at: src/demo and in: http://lucene.apache.org/java/2_4_0/demo2.html
using Information Retrieval techniques makes the indexing take longer, but every search will not require going through most of the titles, and overall you can expect better performance for searching.
also, choosing good Analyzer enables you to ignore words such "the","a"...

amit
  • 175,853
  • 27
  • 231
  • 333
1

Apache Lucene with Solr is definitely a very good option for your problem

You can directly link Solr/Lucene to directly index your MySQL database. Here is a simple tutorial on how to link your MySQL database with Lucene/Solr: http://www.cabotsolutions.com/2009/05/using-solr-lucene-for-full-text-search-with-mysql-db/

Here are the advantages and pains of using Lucene-Solr instead of MySQL full text search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html

rkg
  • 5,559
  • 8
  • 37
  • 50