0

I am trying to crawl data from an oracle database table and want a solution for making checkpoint to the rows.

  • Basicly, the crawler will crawl at most 500 rows, It then stop for the application to process the data. After 2 minutes the crawler must continue from the last row (checkpoint).

  • Currently, I am using the solution with rownumber, but the result indicated some rows are missing.

This is probably a basic question, but being inexperienced with dealing with database. I just cannot find the acceptable solution.

Justin Cave
  • 227,342
  • 24
  • 367
  • 384
MikeNQ
  • 653
  • 2
  • 15
  • 29

1 Answers1

2

Normally, you wouldn't design the system this way. Normally, your application would simply open a cursor, fetch 500 rows, process those rows in the application, fetch the next 500 rows, etc. So long as you don't close the cursor, Oracle will continue to return results as of the system change number (SCN) when you opened the cursor. Any rows that are added after the cursor is opened won't be returned, any rows that are deleted after the cursor is opened will be returned. Since readers do not block writers in Oracle, this approach does not prevent any other applications from using the table (which is often an issue in other databases).

If you really want to design a system the way you describe, you would need to have a column (or set of columns) that you can order by consistently. This would have to be something like a monotonically increasing primary key (e.g. a sequence-generated primary key). You could then do something like

SELECT *
  FROM( SELECT a.*,
               rownum rn
          FROM( SELECT *
                  FROM table_name
                 ORDER BY some_key ) a
         WHERE rownum <= MAX_ROW)
 WHERE rn >= MIN_ROW

Of course, with this sort of pagination query, your process will get slower and slower as you get toward the "end" of the table. Getting the first 500 rows will be reasonably efficient since you just have to sort the first 500 rows. When you're fetching rows 9,500 through 10,000, you're sorting 20 times as much data so that query will likely be an order of magnitude slower. If your table is large, this will generally cause performance to be quite poor. You can (slightly) improve the performance by adding logic that flips the order in the inner query and start fetching "earlier" pages once you've fetched half the pages.

This approach can also miss rows. If, for example, the transaction that inserts ID 502 commits, then you fetch rows 501-550, then the transaction that inserts 'ID` 501 commits, you will see row 502 but not row 501.

Justin Cave
  • 227,342
  • 24
  • 367
  • 384
  • 1
    +1 of course. Link to Oracle read consistency documentation -> http://docs.oracle.com/cd/E11882_01/server.112/e25494/ds_txnman010.htm – David Aldridge Dec 17 '12 at 07:21
  • Thanks for the detail solution. I don't have access to the deeper layer of code that handle the connection and since the table is really big, crawling everything in one go is gonna blow the whole server up. But I will find something based on your solution. – MikeNQ Dec 17 '12 at 09:03
  • @MikeNQ - Why would it "blow the whole server up" to read the contents of the table using a single cursor? That is far, far, far less resource intensive than running a pagination query repeatedly. And, as I explained, in Oracle readers don't block writers or readers so your process won't prevent any other process from doing its work. – Justin Cave Dec 17 '12 at 15:15