2

I have an sql query which fetch the first N rows in a table which is designed as a low-level queue.

select top N * from my_table where status = 0 order by date asc

The intention behind this query is as follows:

  • First, this question is intended to be database agnostic, as my implementation will support sql server, oracle, DB2 and sybase. The sql syntax above of "top N" is just an example.
  • The table can contain millions of rows.
  • N is a relatively small number in comparison, e.g. 100.
  • status is 0 when the row is in the queue. Later it is changed to 1 to indicate that it is in processing. After processing it is deleted. So it is expected that at least 90% of the rows in the table will be with status 0.
  • rows in the table should be fetched according to their date, hence the order by clause.

What is the optimal index to make this query works fastest?

I initially thought the index should be on (date, status), but I am not sure about it anymore. Since the status column will contain mostly zeros, is there an added-value to it? Will it be sufficient to index by (date) alone?
Or maybe it should be (status, date)?

Yoni
  • 10,171
  • 9
  • 55
  • 72

2 Answers2

2

I don't think there is an efficient solution that will be RDMS independent. For example, Oracle has bitmap indexes, SQLServer has partial indexes, and I don't see reasons not to use them if, for instance, Mysql or Sqlite has nothing similar. Also, historically SQLServer implements clustered tables (or IOT in Oracle world) way better than Oracle does, so having clustered index on date column may work perfectly for SQLServer, but not for Oracle.

I'd rather change approach a bit. If you say 90% of rows don't satisfy status=0 condition, why not try refactoring schema, and adding a new table (or materialized view) that holds only records you are interested in ? The number of new programmable objects required for keeping that table up-to-date and merging data with original table is relatively small even if RDMS doesn't support materialized view directly. Also, if it's possible to redesign underlying logic, so rows never updated, only inserted or deleted, then it will help avoiding lock contentions , and as a result , the whole system will have a better performance .

a1ex07
  • 36,826
  • 12
  • 90
  • 103
  • 90% of the rows *will* have status=0. I don't want to refactor the schema, the whole point is to fetch records from the table, change status=1, and then do some processing. After the processing, those records will be deleted. To ensure HA, periodically I'll scan the table for records with status=1 which were not processed (e.g., the server went down after fetching them) – Yoni Jun 11 '12 at 07:25
  • You can try partitioning table by status field. So you will have logically 1 table which stored in 2 partitions. As far as I know, most RDMS support partitioning... – a1ex07 Jun 11 '12 at 13:39
0

Have a clustered index on Date and a non clustered index on Status.

Saurabh R S
  • 3,037
  • 1
  • 34
  • 44
  • 1
    Having a clustered index on Date makes sense for SQL server, but I'd think [at least] twice before making table IOT in Oracle . In addition to possible performance overhead if date column is not unique, it will also require extra column. Quite the contrary with index on Status - Oracle bitmap index is great for that purpose, SQLServer (b-tree) index is not. – a1ex07 Jun 10 '12 at 15:38
  • Can you elaborate and explain why? – Yoni Jun 11 '12 at 07:24
  • As @a1ex07 commented, I suggested keeping SQL Server in mind and I think that would hold true. For more information you can search the difference between clustered and non clustered indexes in sql server. – Saurabh R S Jun 11 '12 at 07:47