We are having a tabular forms (mainly CSV files) coming into the warehouse. Our achitecture is based on adding each field/value as a single row into a satellite table on Snowflake.
After doing all necessary merge into
queries, basically to add hash keys, load dates, record sources and other metadata about the incoming data into several tables and to make sure not to add existing records again, we start a loop over fields that are saved into an array extracted from the file. And then each field/value will be added into the table if its parent key does not exist in the parent hub table.
Here is the loop script:
for (var col_num = 0; col_num<odk_fields.length; col_num = col_num+1){
var COL_NAME = odk_fields[col_num];
var TABLE_COL_NAME = table_fields[col_num];
var obs_field_query = "MERGE INTO LINK_OBSERVATION_FIELD AS LOBSF "+
"USING (SELECT T.OBSERVATION_DATE, T.CAMPNO, T._SUBMISSION_TIME FROM "+TEMP_TABLE_NAME+" T) ST "+
"ON (MD5(CONCAT_WS('', ?, DATE(ST.OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', CONCAT(ST.CAMPNO, ST._SUBMISSION_TIME))) = LOBSF.OBSERVATION_DATE_LOCATION_HASH_KEY) "+
"AND MD5(?)=LOBSF.FIELD_NAME_HASH_KEY "+
"WHEN NOT MATCHED THEN "+
"INSERT (FIELD_NAME_OBSERVATION_HASH_KEY, LOAD_DT, RECORD_SRC, OBSERVATION_DATE_LOCATION_HASH_KEY, FIELD_NAME_HASH_KEY) "+
"VALUES (MD5(CONCAT_WS('', ?, DATE(ST.OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', ST.CAMPNO)), CURRENT_TIMESTAMP(), 'ONA', "+
"md5(CONCAT_WS('', DATE(ST.OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', ST.CAMPNO, ST._SUBMISSION_TIME)), md5(?)) ";
var obs_field_query_stmt = snowflake.createStatement({sqlText: obs_field_query, binds: [COL_NAME, COL_NAME, COL_NAME, COL_NAME]});
var obs_field_rs = obs_field_query_stmt.execute();
obs_field_rs.next();
if (obs_field_rs) {
var field_value_query = "INSERT INTO SAT_FIELD_VALUE "+
"SELECT md5(md5(CONCAT_WS('', ?, DATE(OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', CAMPNO))), "+
"CURRENT_TIMESTAMP(), NULL, 'ONA', "+TABLE_COL_NAME+", md5(CONCAT_WS('', ?, DATE(OBSERVATION_DATE, 'DD/MM/YYYY'), 'CAMP', CAMPNO)) "+
"FROM "+TEMP_TABLE_NAME+"";
var field_value_query_stmt = snowflake.createStatement({sqlText: field_value_query, binds: [COL_NAME, COL_NAME]});
var field_value_rs = field_value_query_stmt.execute();
field_value_rs.next();
sat_field_val_log += field_value_rs['number of rows inserted'];
}
}
We noticed that all merge into
commands are noticeably fast n executing, but when it comes into the loop, the insert into
query is taking lots of time to be executed.
For a file having 3 rows, with 100 fields each, it would take 5 minutes to be added to Snowflake.
Any idea how to optimize the script to be more efficient in uploading the data?