0

I've created a simple index using Zend_Search_Lucene for searching a list of company names, as I want to be able to offer a search which is more intelligent than a simple MySQL 'LIKE %query%'. I've used the code below, where 'companyname' is the company name and 'document_id' is a unique ID for each document (I'm aware that Lucene assigns one internally, but I understand that can change, whereas my document ID will be static).

$index = Zend_Search_Lucene::create('test-index');

$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 1));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'XYZ Holdings'));
$index->addDocument($document);

$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 2));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'X.Y.Z. (Holdings) Ltd'));
$index->addDocument($document);

$document = new Zend_Search_Lucene_Document();
$document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', 3));
$document->addField(Zend_Search_Lucene_Field::Text('companyname', 'X Y Z Ltd'));
$index->addDocument($document);

$index->commit();

However, when I run the following code to find all companies with variants of 'XYZ' in their name:

$index = Zend_Search_Lucene::open('test-index');
$hits = $index->find('companyname:XYZ');
foreach ($hits as $hit)
{
  print "ID: " . $hit->document_id . "\n";
  print "Score: " . $hit->score . "\n";
  print "Company: " . $hit->companyname . "\n";
}

I end up with the following:

ID: 1
Score: 1
Company: XYZ Holdings

I was expecting XYZ to match all the documents, as the point of having this search is to pick up companies which are have the same name but slightly different punctuation, which can't be catered for in a simple LIKE clause. Is there a reason why Lucene doesn't match all the documents, and is there something I can do to fix this?

I get the same sort of problem if I search for 'companyname:"x.y.z holding"' - this doesn't match anything but 'companyname:"x.y.z holdings"' does. I'd expect Lucene to work out that 'holding' and 'holdings' are sufficiently close to be considered a match.

I'm fairly sure all the documents are indexed because if I search for 'X.Y.Z' I get matches for documents 2 and 3.

Edit: Forgot to mention PHP version (5.3.5-1ubuntu7.4 with Suhosin-Patch) and Zend Framework version (1.11.10-0ubuntu1).

pwaring
  • 3,032
  • 8
  • 30
  • 46

2 Answers2

1

You can fix the issue by preprocessing your content before indexing it. Lucene will work with tokens and you need to treat them as individual units. I did something similar in the past to match version numbers so that searching for 2.0 would also provide 2.0.3 for example, but not 1.2.0.

The toCanonical() function here is not perfect. I recommend you write your own and build a test suite to make sure it converts the text as you expect. What it does is build a longer string by grouping the things that look like acronyms. You can also call it on the search query.

You will need to search in companyname_canonical instead of companyname.

There may be a cleaner way to do it as a filter within Zend Lucene. You might also want to use a stemmer to handle the plural forms and such. There is an implementation of the porter stemmer already written. http://codefury.net/2008/06/a-stemming-analyzer-for-zends-php-lucene/

function toCanonical($text)
{
    $out = $text . ' ';
    $step = $text;

    $pattern = '/([A-Z])[\s\.-]([A-Z])([^a-z])/';
    while (preg_match($pattern, $step)) {
        $step = preg_replace($pattern, '$1$2$3', $step);
        $out .= $step . ' ';
    }

    return $out;
}

function createDocument($id, $companyName)
{
    $canonicalName = toCanonical($companyName);

    $document = new Zend_Search_Lucene_Document();
    $document->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', $id));
    $document->addField(Zend_Search_Lucene_Field::Text('companyname', $companyName));
    $document->addField(Zend_Search_Lucene_Field::UnStored('companyname_canonical', $canonicalName));

}

$index->addDocument(createDocument(1, 'XYZ Holdings'));
$index->addDocument(createDocument(1, 'X.Y.Z. (Holding) Company'));
Louis-Philippe Huberdeau
  • 5,341
  • 1
  • 19
  • 22
  • Thanks, it sounds like Lucene doesn't offer what I want then, as I assumed it would do stemming for you otherwise I'm just reinventing the wheel, and I can guarantee I'll miss something. – pwaring Jan 17 '12 at 09:21
  • The stemming is available as a third-party plugin. However, I don't think what you ask for fits in a normal stemming rule anyway. The Java implementation has a much larger ecosystem to choose from. – Louis-Philippe Huberdeau Jan 17 '12 at 12:56
0

when you index "XYZ Holdings" (say you are using standardAnalyzer), then there will be two tokens "xyz" and "holdings"

In case of "X.Y.Z. (Holdings) Ltd" & there will be "x", "y", "z", "holdings" and "ltd"

In case of "X Y Z Ltd" tokens will be "x", "y", "z" and "ltd"

When you issue companyname:"X.Y.Z" or companyname:"X Y Z" both case 2 and case 3 match. There's no way lucene can know that XYZ in case 1 is also an acronym.

I think you should write your own tokenizer to generate same tokens for "XYZ", "X.Y.Z" and "X Y Z", but this might interfere with other uppercase words that aren't acronyms

naresh
  • 2,113
  • 20
  • 32
  • "There's no way lucene can know that XYZ in case 1 is also an acronym" I think that is the problem - I expect Lucene to be know that three or more uppercase letters followed by a space is likely to be an acronym (like PDF, HTML etc.). I don't really know enough to write my own tokeniser. – pwaring Jan 21 '12 at 21:31
  • You can use StandardAnalyzer (as it removes . in acronyms). I don't know if there's any thing equivalent to that in zend lucene – naresh Jan 22 '12 at 07:29