2

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.

K  Z
=  ========================================== 
1  0.6508784068583483336644518457703156855132
2  0.4078768075307567089075462518978907890789
3  0.5365440453204830852096396398565048002638
4  0.7573281573257782352853823856682368153782

What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.

I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.

select * from (
  select *
  from T
  where Z >= [origin's Z value]
  order by Z asc
) where rownum <= 25;

In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).

Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).

I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.

APC
  • 144,005
  • 19
  • 170
  • 281
Eric
  • 520
  • 1
  • 4
  • 14
  • To be clear. If your table has 200 rows `1 .. 200` and you want the 25 closest to row `100` you will get rows `88 .. 112` ? Same case. What about the 25 closest to row `5` will be rows `1 .. 25` ? – Juan Carlos Oropeza Dec 01 '15 at 21:10
  • 2
    Not sure its better speed wise but try `...WHERE something ORDER BY k ROWS BETWEEN 12 PRECEDING AND 13 FOLLOWING` if I understand it correctly – Mihai Dec 01 '15 at 21:10
  • @JuanCarlosOropeza Sorry, I wasn't crystal clear about edge cases or the actual range of interest. What I actually need is the 25 closest records in either direction. At the beginning and end of the table, I'm not too concerned about a shortage in one direction or the other. The number 25 isn't set in stone either, but was provided mainly for flavor. I figure a solution that solves the problem will apply for varying ranges. – Eric Dec 01 '15 at 21:33
  • @Mihai I had not heard of the BETWEEN ... PRECEDING AND ... FOLLOWING syntax, but I will definitely give this a try. Does it work for an order different from the order of the rows on the disk? – Eric Dec 01 '15 at 21:35
  • you can order by anything you want – Mihai Dec 01 '15 at 21:41
  • @Eric: a big part of the problem is that you're asking for an unbounded set. You want "the twenty-five values closest to n" but databases don't work efficiently on problems like that. They're *real* good at "give me the record with key = n". They're semi-OK at "give me rows with keys between n-25 to n+25". But they're terrible at "here's n. Find me the twenty-five closest values to n, out of a couple hundred million choices". I wish I had a better suggestion for you. Best of luck. – Bob Jarvis - Слава Україні Dec 01 '15 at 23:21

1 Answers1

8

I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)

I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.

One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).

If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

Dave Costa
  • 47,262
  • 8
  • 56
  • 72