1

I'm fighting a case of customer stupidity / stubbornness here. We have an application to look up retail shopper by various criteria. The most common variety we see is some combination of (partial) last name and (partial) postal code.

When they enter the full postal code, it works remarkably well. The problem is they sometimes choose to enter, effectively, postal code like '3%'.

Any miracle out there to overcome our customer stupidity?

ETA: There are two tables involved in this particular dog of an operation: customers and addresses. I'm a DBA involved in supporting this application, rather than on the development side. I have no ability to change the code (though I can pass on suggestions in that vein) but I have some leeway on improving indexing.

Customers has 22 million rows; addresses has 23 million.

"Stupidity" may be a harsh word, but I don't understand why you would ever try to look up a customer by postal code like '3%'. I mean, how much effort is it to type in their full zip or postal code?

stephan
  • 45
  • 1
  • 8
  • When the customer is stupid, what is the impact? Application breaks? Poor performance? Something else? – BobC Jan 18 '17 at 20:33
  • 1
    Depending on your faith, you may either pray, or you may take the view that miracles don't exist. Also, just what you presented so far does not prove that your customer is stupid beyond reasonable doubt; it would be better to leave that out. Now: If you have the proper indexes in place, make sure statistics are up to date. Also, the distribution of postal codes may be skewed; you may need to use histograms. –  Jan 18 '17 at 20:34
  • 1
    By the way, looking again at your title: If you have an index on postal code, the optimizer can use it in queries with conditions such as `like '3%'` - as long as the wildcard is at the end. It wouldn't be able to use it for a condition such as `like '%3'` - but that is **not** your case. So the problem is not the index; the problem may be the stats and possibly a wildly non-uniform distribution of postal codes. –  Jan 18 '17 at 21:12
  • 1
    Possible duplicate of [Oracle query using 'like' on indexed number column, poor performance](http://stackoverflow.com/questions/1676064/oracle-query-using-like-on-indexed-number-column-poor-performance) – Matthew Strawbridge Jan 18 '17 at 21:41
  • @BobC: Apparently the web tier has a 20 second timeout, and sometimes when the criteria are not selective enough, the application times out rather than returning anything. – stephan Jan 19 '17 at 22:32
  • @mathguy: I edited to add some more information. You won't convince me that trying to do a customer lookup by postal code like '3%' is not stupid. That still leaves 3 million rows to be cross-referenced against the customers table... Personally I don't think there's any miracle we can work without changing the code (don't allow single-character searches there) but hoping for something as a tide-me-over. – stephan Jan 19 '17 at 22:37

1 Answers1

1

A difficulty is that

WHERE postal_code LIKE '3%'
AND   last_name LIKE 'MC%'

can usually only benefit from either an index on postal_code or an index on last_name. A composite index on both is no help (beyond the leading column).

Consider this as a possible solution (assuming your table name is RETAIL_RECORDS:

alter table retail_records 
  add postal_code_first_1 VARCHAR2(2) 
       GENERATED ALWAYS AS ( substr(postal_code, 1,1) );

alter table retail_records 
  add last_name_first_1 VARCHAR2(2) 
       GENERATED ALWAYS AS ( substr(last_name, 1,1) );

create index retail_records_n1 
  on retail_records ( postal_code_first_1, last_name_first_1, postal_code );

create index retail_records_n2 
  on retail_records ( postal_code_first_1, last_name_first_1, last_name );

Then, in situations, where postal_code and/or last_name conditions are given to you, also include a condition on the appropriate ...first_1 column.

So,

WHERE postal_code LIKE :p1
AND   last_name LIKE :p2
AND   postal_code_first_1 = SUBSTR(:p1,1,1)
AND   last_name_first_2 = SUBSTR(:p2,1,2)

That's going to allow Oracle to, on average, search through 1/260th of the data. (1/10th for the postal codes and 1/26th for the first letter). OK, there are a lot more last names starting with "M" than with "Z", so that's a little generous. But even for a high-frequency combination (say postal_code like '1%' and last_name like 'M%'), it still shouldn't have to look through more than 1% of the rows.

I expect that you'll have to tweak this once you see what Oracle's Cost-Based Optimizer is actually doing, but I think the basic principle of the idea should be sound.

Matthew McPeak
  • 17,705
  • 2
  • 27
  • 59
  • I obviously did not ask the question clearly enough. There are two tables involved, customers and addresses. The problem is a search on postal code like '3%' will return 3 million rows from addresses that need to be joined against some subset of customers... all subject to the whim of the end user. – stephan Jan 19 '17 at 22:43
  • Ugh. That changes everything. I'd look at denormalizing some data into one of the tables. Otherwise, a `FAST REFRESH ON COMMIT` materialized view of the join between the tables would be a perilous but plausible approach. Really good article on that here: https://orastory.wordpress.com/2014/11/27/the-mess-that-is-fast-refresh-join-only-materialized-views/. – Matthew McPeak Jan 20 '17 at 00:08