I am trying to add a couple of million rows to a table in exasol one column is supposed have increasing integer values (1 - xmio). I can't get my python script to run in parallel and distribute the load to different nodes in the cluster. Because of that the process takes days to complete, which is not sufficient.
Since CSV-Import is parallel in exasol I created a CSV with 250mio rows locally, zipped it and uploaded it into the table, this was the fastest approach yet, taking 7 minutes (bottlenecked by my upload speed).
Other Approaches:
CREATE SCRIPT BIG1 (bound) AS
i = 0
while i < bound do
query([[insert into USER_XXXXX.BIG values (:n, null, null, null, null, null)]], {n=i})
i = i+1
end
/
EXECUTE SCRIPT BIG1(250000000);
Very naive approach because inserts are slow, takes years to complete
CREATE PYTHON SCALAR SCRIPT USER_XXXXX.BIG2 ("AMOUNT_TO_CREATE" DECIMAL(18,0))
EMITS (val DECIMAL(18,0), val BOOLEAN, val BOOLEAN, val BOOLEAN, val BOOLEAN, val BOOLEAN) AS
def generate_sql_for_import_spec(import_spec):
return "SELECT USER_XXXXX.BIG2("+import_spec.parameters["AMOUNT_TO_CREATE"]+")"
def run(ctx):
for i in range(1, ctx.AMOUNT_TO_CREATE + 1):
ctx.emit(i, None, None, None, None, None)
/
IMPORT INTO USER_XXXXX.BIG FROM SCRIPT USER_XXXXX.BIG2 WITH AMOUNT_TO_CREATE='250000000';
Works better and runs fully on the server. Takes about 33 Minutes to execute
I understand, that the script can't run in parallel as is. It would be easy to partition it in 50Mio Chunks and run it on 5 nodes. Unfortunately I have no Idea how to connect to other nodes or assign script execution to specific nodes in the cluster.