0

We have several on premise databases and then so far had also our data warehouse as on premise. Now moving over to the cloud and data warehouse will be in Snowflake. But we still have more on premise source systems than in the cloud, so would like to stick with our on premise ETL solution. We are using Pentaho Data Integration (PDI) as our ETL tool.

The issue we have then is then that the PDI Table output step that is using the Snowflake JDBC driver is horribly slow for bulk loads into Snowflake. A year ago it was even worse, as it then just did INSERT INTO and COMMIT after every row. By today it has improved a lot, (when looking at the Snowflake history/logs) it now seems to do some kind of PUT to some temp Snowflake stage, but then from there still does some kind of INSERT to the target table and this is slow (in our test case then it took an hour to load 1 000 000 records in).

We have used the workaround for the bulk load into that we use SnowSQL (Snowflakes command line tool) scrips to make the bulk load into Snowflake that is orchestrated by PDI then. In our example case it takes then less than a minute to get the same 1 000 000 records into Snowflake.

All stuff that is then done inside the Snowflake database is just done via PDI SQL steps sent to Snowflake over JDBC and all our source system queries run fine with PDI. So the issue is only with the bulk load into Snowflake where we need to do some weird workaround:

Instead of:

PDI.Table input(get source data) >> PDI.Table output(write to Snowflake table)

we have then:

PDI.Table input(get source data) >> PDI.Write to local file >> Snowsql.PUT local file to Snowflake Stage >> Snowsql.COPY data from Snowflake Stage to Snowflake table >> PDI clear local file, also then clear Snowflake stage.

It works, but is much more complex than it needs to be (compared to previous on premise database load for example).

I don't even know if this issue is rather on the Snowflake (if the JDBC driver works not optimal) side or on the PDI side (if it just does not utilize the JDBC driver correctly), but would like to have it working better.

user12761950
  • 139
  • 1
  • 2
  • 9

1 Answers1

0

To bulk load in Snowflake, you need to do the put and copy.

Mike Gohl
  • 627
  • 4
  • 7
  • Thanks, Mike. That is what i said that how we do this now over SnowSQL. The problem/question is then why doesn't it work the same with PDI "Table output" (Bulk load enabled) over Snowflake's JDBC driver? Is it a problem of the Snowflake's JDBC driver or rather how PDI utilizes the driver? – user12761950 Mar 03 '20 at 09:22
  • Most ETL tools have separate bulk load components. The Table output stage is not a bulk load component. Pentaho has bulk load component for other database too like Teradata and Oracle. The component you should be using in the PDI is the Snowflake plugin. https://help.pentaho.com/Documentation/8.3/Products/Bulk_load_into_Snowflake – Mike Gohl Mar 03 '20 at 14:04
  • 1
    Thanks, Mike. The PDI Snowflake plugin is only good to COPY into Snowflake from AWS S3. As described in the initial post, we have on premise source systems and also then PDI is running on premise. In this setup (and then all cloud starters) adding AWS S3 in the flow would just mean another extra layer of complexity. Instead of Get from source and write to Snowflake, we would have then also S3. This is not a solution in this situation, it would just be a replacement for using SnowSQL, but even more complex as there is another party (AWS) included. – user12761950 Mar 05 '20 at 13:48
  • A year ago PDI did "Table output" (Bulk load enabled) to Snowflake as INSERT and COMMIT after each row.Now it does some kind of PUT to STAGE and from there starts INSERT. So a little better, but still not ok. Basically I do not understand why can't it just do the PUT to table stages and COPY to table from there. Is it an issue in the Snowflake JDBC driver that this can not be done? Or just PDI side has not implemented it yet? – user12761950 Mar 05 '20 at 13:48
  • IMHO it is PDI's implementation of the driver. – Mike Gohl Mar 05 '20 at 14:26
  • @user12761950 : I can't get PDI to create a temporary stage, due to not having set an active schema. I tried a bunch of things, including changing the DEFAULT_NAMESPACE for the user, but no dice. How did you manage this? – fdrijver Aug 07 '20 at 07:56
  • I do not currently have an active PDI install to test. Did you try fully qualifying the table name. create temp table .? – Mike Gohl Aug 07 '20 at 13:31
  • I am using a Table Output step, so I don't need to create a temp table. I defined the schema in the step, that should be enough, but unfortunately not – fdrijver Aug 07 '20 at 14:55