0

I have an program that needs to run queries on a number of very large Oracle tables (the largest with tens of millions of rows). The output of these queries is fed into another process which (as a side effect) can record the progress of the query (i.e., the last row fetched).

It would be nice if, in the event that the task stopped half way through for some reason, it could be restarted. For this to happen, the query has to return rows in a consistent order, so it has to be sorted. The obvious thing to do is to sort on the primary key; however, there is probably going to be a penalty for this in terms of performance (an index access) versus a non-sorted solution. Given that a restart may never happen this is not desirable.

Is there some trick to ensure consistent ordering in another way? Any other suggestions for maintaining performance in this case?

EDIT: I have been looking around and seen "order by rowid" mentioned. Is this useful or even possible?

EDIT2: I am adding some benchmarks:

  • With no order by: 17 seconds.
  • With order by PK: 46 seconds.
  • With order by rowid: 43 seconds.

So any order by has a savage effect on performance, and using rowid makes little difference. Accepted answer is - there is no easy way to do it.

rghome
  • 8,529
  • 8
  • 43
  • 62
  • 1
    Why do you think an index access will be used? If most of the data is going to be retrieved, an index would be pointless. Your only guarentee for rows to appear in a consistent order (eg. primary key ascending) is to use an `ORDER BY` clause. However, you also have to consider things like what if someone inserts new data that appears mid-way in the ordered result set, in between the query starting and restarting? That could throw things off. Sorting your result set will most likely have an impact on performance, though. – Boneist Apr 09 '15 at 15:54
  • You can assume for this case that the underlying tables won't be modified. I assume index access is required to retrieve the rows in order if an order by is used. – rghome Apr 09 '15 at 15:59
  • An index won't necessarily be used to do a sort. And it's not the underlying tables changing that I was talking about, but say you have rows with a pk ids of 1, 3, 5 and 7, so you order your results on the id column. Suppose you stop after the 3rd row and then someone inserts a row with id = 4. When you restart your query, the row with id = 5 is no longer the 3rd in the result set, it's the 4th, so your results aren't consistent over both runs. – Boneist Apr 09 '15 at 16:12
  • @Boneist, how are you interpreting "the underlying tables changing" so that it doesn't include "someone inserts a row"? – Dave Costa Apr 09 '15 at 17:06
  • afaik the only way to sort is using an index or a temp table and with tens of millions of rows, a temp table is not an option. – rghome Apr 09 '15 at 17:29
  • 2
    Is partitioning an option? If the table was hash partitioned you could process one partition at a time (without incurring any cost to retrieve a specific partition), and then restart from the latest partition. You don't really need to sort the data, just putting them in (deterministic) hash buckets should be good enough. – Jon Heller Apr 09 '15 at 17:59
  • Might be an option. Can't create them explicitly for this, but if it happened to be partitioned anyway, then I could use it. Thanks. – rghome Apr 09 '15 at 18:02
  • @rghome Oracle does not require index access just because you use ORDER BY. For example - you could drop that index and ORDER BY will still work. – jva Apr 09 '15 at 21:23
  • @jva - yes, this is my question. Yes - it will still work, of course, but if you use a non-indexed column then it has to either sort in memory or using a temp table (how else can it find the order), which will be very slow. So is there another trick I can use to get a consistent result set ordering using with no or minimimal performance hit? – rghome Apr 10 '15 at 07:33
  • @Dave: I was interpreting that to mean DDL on the table, but yes, I can see that maybe the OP meant DML. – Boneist Apr 10 '15 at 11:40

3 Answers3

3

The best advice I can think of is to reduce the chance of a problem occurring that might stop the process, and that means keeping the code simple. No cursors, no commits, no trying to move part of the data, just straight SQL statements.

Unless a complete restart would be a completely unacceptable disaster, I'd go for simplicity without any part-way restart code at all.

David Aldridge
  • 51,479
  • 8
  • 68
  • 96
  • Up vote for basically saying "you can't" which is always a brave answer! I will accept it if I don't find an alternative. – rghome Apr 09 '15 at 17:27
  • Oh, there's always an alternative of course. It's definitely a trade-off though. I've been in situations where it was absolutely required that a job be "recoverable" because it had to run every 24 hours and took 18 to complete. Mind you, if it had been more simple and not recoverable it might have only taken 12. In 95% of cases I've encountered the best defence was a good offence, and that meant small, fast, non-complex code. – David Aldridge Apr 09 '15 at 19:06
1

If you want some order and queried data is unsorted then you need to sort it anyway, and spend some resources to do sorting.
So, there are at least two variants for optimization:

  1. Minimize resources spent on sorting;
  2. Query already sorted data.

For the first variant Oracle on its own calculates a best variant to minimize data access and overall query time. It may be possible to choose sorting order involved in unique index which already used by optimizer, but it's a very questionable tactic.

Second variant is about index-organized tables and about forcing Oracle with hints to use some specific index. It seems Ok if you need to process nearly all records in some specific table, but if selectivity of query is high it's significantly slows a process, even on a single table.

Think about a table with surrogate primary key which holds data with 10-year transaction history. If you need data only for previous year and you force order by primary key then Oracle need to process records in all 10 years one-by-one to find all records which belongs to a single year.
But if you need data for 9 years from this table then full table scan may be faster than index-based choice.
So selectivity of your query is a key to choose between full table scan and result sorting.

For storing results and restarting query a good solution is to use Oracle Streams Advanced Queuing to fed another process.
All unprocessed messages in queue redirected to Exception Queue where it may be processed separately.
Because you don't specify exact ordering for selected messages I suppose that you need ordering only to maintain unprocessed part of records. If it's true then with AQ you don't need ordering at all and may, even, process records in parallel.

So, finally, from my point of view Buffered Queue is what you really need.

ThinkJet
  • 6,725
  • 24
  • 33
  • Just to clarify the point starting "Think about ...": in fact, I will probably need over 95% of the table, so for an unordered select, an FTS is what it will do and as far as I can see, it has no reason to look at the primary index. But if I add an order by on PK it has to look at that index. So, in fact, a more restrictive select on the PK would be better relatively speaking as it would have to access the index anyway for the WHERE, so nothing lost by using it to sort. But in my case, the index is not normally required, so sorting is pure overhead. – rghome Apr 09 '15 at 17:25
  • Please note, that I mention [index-organized tables](https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables012.htm#ADMIN11703) which didn't add overhead if query results sorted only by primary key. But reorganizing a big table is trouble by itself, so it may be not applicable in your case. – ThinkJet Apr 09 '15 at 17:38
  • Yes - not an available option in my case. Thanks anyway. – rghome Apr 09 '15 at 17:40
0

You could skip ordering and just update the records you processed with something like SET is_processed = 'Y' or SET date_processed = sysdate. Complete restartability and no ordering.

For performance you can partition by is_processed. Yes, partition key changes might be slow, but it is all about trade-offs.

jva
  • 2,797
  • 1
  • 26
  • 41
  • I think that updating would be slower than sorting on the primary key. Plus I would have to do a where condition on the flags. The optimizer would still require a full table scan as there would be no index on these columns and if there was it would have to be updated which would slow it down more, and it probably would not have a high enough cardinality to have an effect anyway. – rghome Apr 10 '15 at 08:06
  • Edited answer to address this a bit. – jva Apr 10 '15 at 08:33