How to delegate script execution to different nodes in exasol?

Question

I am trying to add a couple of million rows to a table in exasol one column is supposed have increasing integer values (1 - xmio). I can't get my python script to run in parallel and distribute the load to different nodes in the cluster. Because of that the process takes days to complete, which is not sufficient.

Since CSV-Import is parallel in exasol I created a CSV with 250mio rows locally, zipped it and uploaded it into the table, this was the fastest approach yet, taking 7 minutes (bottlenecked by my upload speed).

Other Approaches:

CREATE SCRIPT BIG1 (bound) AS
    i = 0
    while i < bound do
        query([[insert into USER_XXXXX.BIG values (:n, null, null, null, null, null)]], {n=i})
        i = i+1
    end
/
EXECUTE SCRIPT BIG1(250000000);

Very naive approach because inserts are slow, takes years to complete

CREATE PYTHON SCALAR SCRIPT USER_XXXXX.BIG2 ("AMOUNT_TO_CREATE" DECIMAL(18,0))
        EMITS (val DECIMAL(18,0), val BOOLEAN, val BOOLEAN, val BOOLEAN, val BOOLEAN, val BOOLEAN) AS
def generate_sql_for_import_spec(import_spec):
 return "SELECT USER_XXXXX.BIG2("+import_spec.parameters["AMOUNT_TO_CREATE"]+")"

def run(ctx):
 for i in range(1, ctx.AMOUNT_TO_CREATE + 1):
  ctx.emit(i, None, None, None, None, None)
/

IMPORT INTO USER_XXXXX.BIG FROM SCRIPT USER_XXXXX.BIG2 WITH AMOUNT_TO_CREATE='250000000';

Works better and runs fully on the server. Takes about 33 Minutes to execute

I understand, that the script can't run in parallel as is. It would be easy to partition it in 50Mio Chunks and run it on 5 nodes. Unfortunately I have no Idea how to connect to other nodes or assign script execution to specific nodes in the cluster.

score 1 · Accepted Answer · answered May 09 '19 at 14:25

1

You are supposed to run multiple instances of the script on multiple nodes. The easiest way to achieve parallelism is to create a proxy table containing as many rows as number of parallel processes you want you want to run.

Let's assume we want to run 5 parallel processes.

CREATE TABLE script_parallel
(
    proc_id DECIMAL(18,0)
);

INSERT INTO script_parallel VALUES ((1),(2),(3),(4),(5));

You may use this table to run multiple instances of script now:

SELECT my_script(proc_id, 5, 250000000)
FROM script_parallel
GROUP BY proc_id

Inside each script instance EMIT every (Nth value + proc_id) up to max value of 250000000. Every script instance should generate smaller subset of final result. Exasol will merge results of all instances for you.

It's a good idea to use LUA scripting language for scripts of this kind. Performance will be 100x better.

answered May 09 '19 at 14:25

wildraid

126
4

Thats a good answer that helped me solving the problem. But didn't solve the problem. Anyway you get the check mark for leading me on the right path. The "GROUP BY" clause causes an error in my case because it turns the SELECT into a SET, which is not allowed. However, rewriting my script to emit chunks instead of all 250mio lines and using your helper-table approach i managed to generate the rows in 6.5 minutes on 6 nodes (which is the maximum I am allowed to use). Thanks – Lennart May 10 '19 at 08:45
You may run more than one scripting language container per node. The only limit is the number of CPU cores available, but it is shared by all SQL processes. In the worse case scenario some groups will be postponed and executed a bit later when CPU resources are available. – wildraid May 11 '19 at 10:00

How to delegate script execution to different nodes in exasol?

1 Answers1