Faster way to load huge data warehouse table

Question

I have a table in oracle 11g with size 62GB and 1.2 billion records. Table has 4 columns, range partitioned on month and hash sub partition on transaction number.

Around 80 million records are delated and re-inserted into this table once in every week. This delete & insert process is taking ~4 hours when done with NOAPPEND PARALLEL hint and .

Is there any way i could speed up this process? One way i could think is to replace NOAPPEND with APPEND but that will lead to space wastage and drastic increase in table size.

Why are 80 million rows deleted and re-inserted, and how do you identify which rows they are? — David Aldridge, Aug 26 '15 at 15:22
Its supposed to be an update statement. Since update was more costlier, we chose delete and re-insert with new data — Vivek, Aug 26 '15 at 15:27
Hmmm, well update shouldn't be more costly than a delete/insert. Consider the effect on indexes for which the value did not change during the update. — David Aldridge, Aug 26 '15 at 15:30
This table has no index. Partitions and sub-partition helps in faster read. — Vivek, Aug 26 '15 at 15:45
Is the data changing spread across the table, or isolated to a few partitions? — Craig, Aug 26 '15 at 16:31
It's odd that an UPDATE is slower than a DELETE and then INSERT. That implies to me that maybe the UPDATE wasn't properly parallelized. You may want to double-check the parallel hint or settings, and also make sure that `alter session enable parallel dml;` was executed in the session. You can check for that by looking at GV$SESSION.PDML_STATUS. — Jon Heller, Aug 26 '15 at 18:28

score 0 · Answer 1 · answered Aug 26 '15 at 15:20

0

APPEND is made exactly for this purpose. The amount of wasted space depends on extent size. Each INSERT creates one extent per parallel process, fills it and creates new one if needed. So with common settings, having 8 MB extent in partitioned table and you are inserting about 4 GB (62 GB / (1200M / 80M) records) the average waste will be 4 MB * parallel degree which I would say is decent. INSERT /*+ APPEND PARALLEL */ can be super fast - multi-million rows per second (and Gigabytes per second) on decent hardware. It mostly depends on number of indexes because their maintenance is the most time taking part.

The bigger issue is the DELETE part. You should think about if and how it can be transformed into DDL partition operation (CTAS and EXCHANGE PARTITION, etc.)

answered Aug 26 '15 at 15:20

Husqvik

5,669
1
19
29

There would be wasted space because the space made available by deletions would not be re-used by the direct path insert, unless some kind of shrink was employed, but that would take more time. – David Aldridge Aug 26 '15 at 15:23
True. Does the deletion affect only few partitions? Or many of them. In other words do you know in advance what partitions will be affected? – Husqvik Aug 26 '15 at 15:36
Yes, we know the partition that will be affected in advance as thats used in one of join condition. But the impact is to almost all partions, not concentrated on any specific range. – Vivek Aug 26 '15 at 17:09
For calculating size wasted, don't forget that each partition or subpartition has its own segment. So the amount of space wasted might be megabytes multiplied by the parallel degree and then multiplied by a large number of subpartitions. – Jon Heller Aug 26 '15 at 18:26

score 0 · Answer 2 · answered Aug 27 '15 at 10:50

Check which one of the two processes - deleting or inserting - is slower and start from there. Data deletion can generate a lot of redo and can be slow, especially if issued in a single thread. Also, update should in theory be faster than delete+insert.

Update speed can be heavily affected by table compression and the pctfree table physical attribute, so check them. It is even more heavily affected by parallelism. Issuing it in parallel might drastically improve performance.

To be able to execute parallel dml operations in most cases you will need to enable them on session level by issuing:

alter session enable parallel dml;

I'd suggest that you try the following:

Check the table's compression and pctfree options
Try enabling session parallel DML before running the queries.
Try the update statement in parallel - again issue 'alter session...' before the statement
Try MERGE statement, if you have any kind of a primary key columns - in some cases it can be faster than update
If the data, which needs to be updated, fills entire partitions or subpartitions, you can use TRUNCATE instead of delete. It is a very fast operation (DDL) which empties a specific partition or subpartitions.
If all else fails, you can create a new table with the same definition and load the remaining data from the main table and the new data in there (again in parallel and with an APPEND hint). After that you can drop the old and rename the new table. Be sure to manage table grants and other privileges. Shifting 60GB of data is a cumbersome operation and needs a lot of temporary space, but at least you will avoid the DELETE operation. On a reasonable hardware it should take less than 4 hours. For the main bulk of data you can also issue a CTAS statement:
```
create table <new_table_name> parallel <partitioning_info> as select * from <old_table_name> where <filter>;
```
Beware of library cache locks when issuing a long-running CTAS statements.

Hi, can you explain a litle bit more about compression and pctfree? We have a similar table on which we have enabled advanced table compression, so far haven't seen any drop in load time though. Are you saying compression slows down loading process? And what value should pctfree have? — Sherzodbek, Apr 21 '22 at 04:48

score 0 · Answer 3 · answered Aug 28 '15 at 17:11

I feel that there are multiple problems in current solution, you might want to revisit the partitioning strategy as well.

Ideally,

1.Load the new data to a new table or it could be external table (staging) 2.Join the new data with existing data (update !!) 3.Insert the new data to directly from step 2 as a CTAS with parallel , direct path insert .., with little bit more preparation you can make it a new partition table as well.. 4.Once the new Insert completes (CTAS), drop the existing table and rename the new table to the old (dropped) table name.

Only issue with this approach is , it requires space. In my experience, this is the best workarounds for large costly update problems in data warehouses. I have tried and tested this in Oracle RAC/Exadata environments.

score 0 · Answer 4 · answered Sep 03 '15 at 10:17

When working on such large tables you must already be batching your data to touch only one partition in single batch. If there is any way to avoid physical deletes by updating the existing record or logically set the delete flag against deleted record and insert new records.

The other way of optimizing the whole load process in this case would be:

Create a separate staging table with same structure and with same partition for which your current batch is performing the load.
Load the existing data of that partition which is still valid (not the delete candidate records), also insert the new data to be inserted in destination table.
Once the staging table have all the data which needs to go in existing table, switch the partition from staging table to destination table, in a similar fashion described on this page.

Faster way to load huge data warehouse table

4 Answers4

Linked