We have a “cross-reference” module that I’m attempting to build into a Zend Lucene search index and having major performance issues. Tried this ten different ways and either zero/bad results, or super accurate and extremely slow performance. The current solution described below results in accurate results, but painfully slow on our fairly large data set.
Essentially, let us say we have ACME Part #ABC123 and a couple of other part numbers. Manufactured by ACME, and distributed by WalMart, Costco, and Target. Each of these three stores has their OWN part number that they apply to the product on their website. The following table is a pretty clear example:
So, what we want to see is when a user searches for “Target 1009936”, would return both CASE and BOX sizes of ACME Part #ABC123. When I search for “WalMart WM0002353CS”, would return CASE size for ACME Part #DEF567. And so on.
The fact that some wholesaler/distributors work the unit of measure into their numbering system, and some to not, is irrelevant. What is relevant is getting Lucene to index and retrieve the data quickly, which is not currently the case for our database of over two million product SKUs.
In order to achieve this what we have done is to add an indexed field for each wholesaler, where cross-reference data for that item exists (let’s just call the fields crosstar, crosswal, and crosscostco. So, in this example we have cross-reference data for all, but in reality we only have for some (but a LOT… millions upon millions of cross-references). So, in this case, were I to save a log on the $hit->crosstar/crosswal/crosscostco values into a log file, we would see:
=>10001123
crosstar: 1009936
crosswal: WM0001256CS
crosscostco: CC5991243CA
crosstarfull: Target 1009936
crosswalfull: WalMart WM0001256CS
crosscostcofull: CostCo CC5991243CA
We are indexing both the part, and separately, wholesaler’s name + part (a lot of people in our business actually search this way - I know it sounds crazy), which is why there are six fields beneath the sku / primary key, not three.
Problem is, the way the data is being tokenized and indexed, because the wholesaler names are causing too much similarity between each of the indexed values, it’s causing Lucene to work a lot harder and longer to locate the record. And when a user types in simply “Target” or “WalMart” or “Costco”, it’s like brain overload, man. Poof.
I am a mediocre developer at best, and the fact that I cannot figure this out is pretty much proof of that. Does anyone have any recommendations on how this scenario could be reworked to produce the desired results?