1

We are having a tabular forms (mainly CSV files) coming into the warehouse. Our achitecture is based on adding each field/value as a single row into a satellite table on Snowflake.

After doing all necessary merge into queries, basically to add hash keys, load dates, record sources and other metadata about the incoming data into several tables and to make sure not to add existing records again, we start a loop over fields that are saved into an array extracted from the file. And then each field/value will be added into the table if its parent key does not exist in the parent hub table.

Here is the loop script:

for (var col_num = 0; col_num<odk_fields.length; col_num = col_num+1){
    var COL_NAME = odk_fields[col_num];
    var TABLE_COL_NAME = table_fields[col_num];
    var obs_field_query = "MERGE INTO LINK_OBSERVATION_FIELD AS LOBSF "+
                "USING (SELECT T.OBSERVATION_DATE, T.CAMPNO, T._SUBMISSION_TIME FROM "+TEMP_TABLE_NAME+" T) ST "+
                "ON (MD5(CONCAT_WS('', ?, DATE(ST.OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', CONCAT(ST.CAMPNO, ST._SUBMISSION_TIME))) = LOBSF.OBSERVATION_DATE_LOCATION_HASH_KEY) "+
                "AND MD5(?)=LOBSF.FIELD_NAME_HASH_KEY "+
                "WHEN NOT MATCHED THEN "+
                "INSERT (FIELD_NAME_OBSERVATION_HASH_KEY, LOAD_DT, RECORD_SRC, OBSERVATION_DATE_LOCATION_HASH_KEY, FIELD_NAME_HASH_KEY) "+
                "VALUES (MD5(CONCAT_WS('', ?, DATE(ST.OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', ST.CAMPNO)), CURRENT_TIMESTAMP(), 'ONA', "+
                "md5(CONCAT_WS('', DATE(ST.OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', ST.CAMPNO, ST._SUBMISSION_TIME)), md5(?)) ";
    var obs_field_query_stmt = snowflake.createStatement({sqlText: obs_field_query, binds: [COL_NAME, COL_NAME, COL_NAME, COL_NAME]});
    var obs_field_rs = obs_field_query_stmt.execute();
    obs_field_rs.next();
    if (obs_field_rs) {

            var field_value_query = "INSERT INTO SAT_FIELD_VALUE "+
                                    "SELECT md5(md5(CONCAT_WS('', ?, DATE(OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', CAMPNO))), "+
                                    "CURRENT_TIMESTAMP(), NULL, 'ONA', "+TABLE_COL_NAME+", md5(CONCAT_WS('', ?, DATE(OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', CAMPNO)) "+
                                    "FROM "+TEMP_TABLE_NAME+"";
            var field_value_query_stmt = snowflake.createStatement({sqlText: field_value_query, binds: [COL_NAME, COL_NAME]});
            var field_value_rs = field_value_query_stmt.execute();
            field_value_rs.next();
            sat_field_val_log += field_value_rs['number of rows inserted'];

    }
}

We noticed that all merge into commands are noticeably fast n executing, but when it comes into the loop, the insert into query is taking lots of time to be executed.

For a file having 3 rows, with 100 fields each, it would take 5 minutes to be added to Snowflake.

Any idea how to optimize the script to be more efficient in uploading the data?

alim1990
  • 4,656
  • 12
  • 67
  • 130
  • How long do the merge and insert statements take when you look at the query history within Snowflake? what makes you think it's the insert that is slow and not something else within your javascript? also - why are you creating an MD5 hash of an MD5 hash here: `SELECT md5(md5(CONCAT_WS...` – Simon D Feb 24 '21 at 12:07
  • You don't want to insert one row at a time in a loop. I recommend using a counter and a string that adds each new row to a `values` clause. When you get to 1000 rows, you can then run the insert, reset the counter, and reset the insert statement. When you leave the loop you can run a final insert to process the last ones under the 1000 threshold or whatever you set. – Greg Pavlik Feb 24 '21 at 13:22
  • @SimonDarr the sat_field_value primary key is useless and never been used, so we figured out that we can hash the already hashed FK coming from the hub table, we are considering changing it to the hash of a sequenced value. – alim1990 Feb 25 '21 at 08:50
  • @GregPavlik you mean, to set each field and its value into an array, and the run insert into of the whole array into the table? – alim1990 Feb 25 '21 at 08:51
  • apparently guys that each row is taking half a second for each `merge/insert/...`, is it because of the hashing ? – alim1990 Feb 25 '21 at 09:13
  • 1
    @alim1990 Not an array, a list of values to insert more than one row at a time. Inserting one row at a time will be very slow. If you have a values clause with 1000 rows, it will finish almost as quickly as a single-row insert. – Greg Pavlik Feb 26 '21 at 04:48
  • @GregPavlik ok so you mean we run the loop over a variable to include all insert clauses like (1, 'a'), (2, 'b') and so on and the we concatenate it to the initial query and thats it? – alim1990 Feb 26 '21 at 05:20
  • 1
    @alim1990 yes, exactly. Keep in mind though that Snowflake has a 1Mb size limit for a SQL statement, so if you have a lot of rows you may want to run an insert every 1000 rows or so. – Greg Pavlik Feb 26 '21 at 05:22

0 Answers0