How to improve the performance of this insert statements for huge data ingest?

Question

I have a requirement to insert huge (50GB of random data) into my database, so that I can use a backup application to check the de-duplication ratio. I have written a small procedure like below

This is taking more than 1 hour. I don't know how to improve the performance so that I get good throughput for the insert statements. I have set SGA as 16GB.

I am newbie to Oracle. I do not know how to set parallelism to optimize my procedure to get good throughput. Please help.

alter session force parallel query parallel 4;
create table table_1(
col1 varchar2(400),
-- 50 columns like this
col50 varchar2(400));

create table table_2(
col1 varchar2(400),
-- 50 columns like this
col50 varchar2(400));

create table table_3(
col1 varchar2(400),
-- 50 columns like this
col50 varchar2(400));

create table table_4(
col1 varchar2(400),
-- 50 columns like this
col50 varchar2(400));

My insert script:

Declare
    rows_inserted number := 0;
Begin
  Loop
        Begin
            INSERT INTO table_1(COL1, ..COL50)
            VALUES(dbms_random.string('L', 400),..for all 50 values);
        INSERT INTO table_2(COL1, ..COL50)
            VALUES(dbms_random.string('L', 400),..for all 50 values);
            INSERT INTO table_3(COL1, ..COL50)
            VALUES(dbms_random.string('L', 400),..for all 50 values);
            INSERT INTO table_4(COL1, ..COL50)
            VALUES(dbms_random.string('L', 400),..for all 50 values);
            --Only increment counter when no duplicate exception
            rows_inserted := rows_inserted + 1;
        --Exception When DUP_VAL_ON_INDEX Then Null;
        End;
        exit when rows_inserted = 10000;
    End loop;
    commit;
End;
/

I have tried this procedure on Oracle12c, which is installed on rhel 7 VM. The Vm has 32 GB memory and 20GB swap memory and 16 vcpus.

It's taking more than 1 hour and its still running. How to implement parallelism and optimize above procedure to get a good throughput rate?

APC · Answer 1 · 2019-06-17T08:10:14.570

2

You're doing single row inserts inside a loop: that's a very slow way of doing of things. SQL is a set-based language and set operations are the most performative way of doing bulk-operations. Also, you're also relying on random data to provide duplicates. Be in control of it and guarantee the ratios. Besides, how can you get DUP_VAL_ON_INDEX when your tables have no unique keys? (And if they did, you wouldn't be able to insert the duplicates you want for your experiment.)

A better approach would be to use bulk sql:

INSERT INTO table_1(COL1, COL50)
select dbms_random.string('L', 400), dbms_random.string('L', 400)
from dual
connect by level <= 10000
/

INSERT INTO table_1(COL1, COL50)
select *
from table_1
where rownum <= 1000 
/

This will give you 11000 rows in table_1, 1000 of which are duplicates. Repeat the second insertion to increase the number of duplicates.

There should be no need for parallelism.

ALl i want now is good throughput, which can insert 50 GB of data within 30 minutes,with or without parallelism.

However, this new piece of information changes my assessment. The simplest way to run this in parallel is to build separate routines for each table and run each in a separate session.

edited Jun 17 '19 at 08:10

answered Jun 15 '19 at 09:24

APC

144,005
19
170
281

The above procedure works.But i have a requirement of not repeating the data.Even though with random data, there are chances of data being repeated.But is there a way of generating lots of data (random) with very less time, other than going for tools. – simi Jun 15 '19 at 16:20
Sorry, your question was not clear. It sounded like you wanted tables *with* duplicate for some purpose.n – APC Jun 15 '19 at 16:34
Your question and subsequent appear contradictory. Initially you state you are "check the de-duplication" which implies you require duplicates. Later though you state having "requirement of not repeating data". Beyond that there's little for optimization as the time is not in writing the data but in the context switches for the 2Million calls to dbms_random. – Belayer Jun 15 '19 at 18:04
Sorry, my requirements were not elucidated properly,But what are the ways to generate random data. So far, i found sequence(not good in my case, as its consecutive), random, and using sys_guid(). Why are the inserts in Oracle , so slow? :( – simi Jun 15 '19 at 19:52
How much randomness do you need? You have 50 columns per table, do you want every column of every row to be distinct? Or just every row to be different? It would help if you defined a unique on your tables. Apart from actually enforcing uniqueness, keys are a useful documentation tool, communicating your data model and its rules to other developers using your schema - in this case, us. – APC Jun 15 '19 at 21:17
2

*"Why are the inserts in Oracle , so slow? "* When used properly Oracle inserts are blisteringly fast. What you are doing is extremely peculiar, and in no way a normal use case. Fifty columns of `varchar2(400)` is not a common table projection. In fact it's weird. Filling them all with randomly generated gibberish is also weird. So it's not unreasonable that performance runs on the a slower than you might like. That's why you need to explain more clearly **what you are actually trying to achieve**. Otherwise we can't help you further. – APC Jun 15 '19 at 21:25
Okay..I will say my requirement properly this time..The product, i am testing is a backup application.This backup application, best feature is in de-duplication. It dedupes a lot of data and saves a lot of space, which in turn helps the customers, while saving a lot of money on servers. – simi Jun 16 '19 at 10:16
The product, i am testing is a backup application.This backup application, best feature is in de-duplication. It dedupes a lot of data and saves a lot of space, which in turn helps the customers, while saving a lot of money on servers.So this product is used by customers with 3rd party applications -oracle, ms sql and all.So there is a customer use case, which i need to simulate.So i have installed a RHEL Vm with oracle 12c and written a procedure, which does huge inserts (50Gb/100GB)and then add oracle in the backup appl(the product i am testing) and check the dedupe ratio. – simi Jun 16 '19 at 10:29
However, i am new to Oracle and struggling to write a proper procedure, which can insert some random data(even thought not unique, for every row/columns).ALl i want now is good throughput, which can insert 50 GB of data within 30 minutes,with or without parallelism. – simi Jun 16 '19 at 10:34
I have provided the best solution which show the most efficient way of inserting random data. However you don't like it but it seems like you are too new to Oracle to articulate what it is you need to ask. I suggest you **edit your question** to specify your requirements. Perhaps by specifying some input data and required output, perhaps illustrating what the "backup app" does. Unfortunately if you can't give a clear question there's not much we can do to help. – APC Jun 16 '19 at 14:44
Yes, i understand inserting row-wise is not the most intelligent solution.And ow it can be accelerated by set operations, as you have elucidated. – simi Jun 17 '19 at 07:39
Yes, i understand inserting row-wise is not the most intelligent solution.And ow it can be accelerated by set operations, as you have elucidated.The thing is, the above procedure , does executes very fast , but the data ingested is in mb. WHat is the best way to produce 50GB . – simi Jun 17 '19 at 07:47
Even if i take the above procedure ,which generates ,lets say 1 GB of data in 5 minutes.Is there a way, i can execute these procedures in parallel and increase the throughput? I have read the usage of DBMS_SCHEDULER.RUN_JOB('packagename.', FALSE); – simi Jun 17 '19 at 07:47
1

You're setting up your test to fail. Random data is close to white noise, which de-duplicates/compresses **terribly**. I'd be surprised if you get any significant de-duplication as a result. If you're trying to test the de-duplication features of your backup tool, you should use _production-like data_. – Chris Saxon Jun 17 '19 at 09:04
So using a datgenerator tool, would be the most feasible solution? – simi Jun 17 '19 at 09:17

How to improve the performance of this insert statements for huge data ingest?

1 Answers1