I've created a simple index using Zend_Search_Lucene for searching a list of company names, as I want to be able to offer a search which is more intelligent than a simple MySQL 'LIKE %query%'. I've used the code below, where 'companyname' is the company name and 'document_id' is a unique ID for each document (I'm aware that Lucene assigns one internally, but I understand that can change, whereas my document ID will be static).
$index = Zend_Search_Lucene::create('test-index');
$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 1));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'XYZ Holdings'));
$index->addDocument($document);
$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 2));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'X.Y.Z. (Holdings) Ltd'));
$index->addDocument($document);
$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 3));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'X Y Z Ltd'));
$index->addDocument($document);
$index->commit();
However, when I run the following code to find all companies with variants of 'XYZ' in their name:
$index = Zend_Search_Lucene::open('test-index');
$hits = $index->find('companyname:XYZ');
foreach ($hits as $hit)
{
print "ID: " . $hit->document_id . "\n";
print "Score: " . $hit->score . "\n";
print "Company: " . $hit->companyname . "\n";
}
I end up with the following:
ID: 1
Score: 1
Company: XYZ Holdings
I was expecting XYZ to match all the documents, as the point of having this search is to pick up companies which are have the same name but slightly different punctuation, which can't be catered for in a simple LIKE clause. Is there a reason why Lucene doesn't match all the documents, and is there something I can do to fix this?
I get the same sort of problem if I search for 'companyname:"x.y.z holding"' - this doesn't match anything but 'companyname:"x.y.z holdings"' does. I'd expect Lucene to work out that 'holding' and 'holdings' are sufficiently close to be considered a match.
I'm fairly sure all the documents are indexed because if I search for 'X.Y.Z' I get matches for documents 2 and 3.
Edit: Forgot to mention PHP version (5.3.5-1ubuntu7.4 with Suhosin-Patch) and Zend Framework version (1.11.10-0ubuntu1).