7

I'm attempting to bulk insert into SQL Server table from a databricks notebook using a similar method as this:

Bulk copy to Azure SQL Database or SQL Server

This works fine, until I attempt to write to a column of datatype datetime. The table I'm attempting to write to has this schema:

create table raw.HubDrg_TEST
(
  DrgKey varchar(64) not null,
  LoadDate datetime,
  LoadProcess varchar(255),
  RecordSource varchar(255),
  DrgCode varchar(255)
 )

My Scala code is as follows:

//Get dataset for data in staging table
var stagedData: DataFrame = spark.read
  .format("com.databricks.spark.sqldw")
  .option("url", sqlDwUrlSmall)
  .option("tempDir", tempDir)
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("query", "select distinct CodeID as DrgCode, getdate() as LoadDate from StageMeditech.livendb_dbo_DAbsDrgs").load() 

//Get dataset for data in existing Hub
val existingHub: DataFrame = spark.read
  .format("com.databricks.spark.sqldw")
  .option("url", sqlDwUrlSmall)
  .option("tempDir", tempDir)
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("query", "Select DrgKey as ExistingDrgKey from raw.HubDrg_TEST")
  .load()

val sha_256 = udf((s: String) => { String.format("%032x", new BigInteger(1, MessageDigest.getInstance("SHA-256").digest(s.getBytes("UTF-8")))) })

//Add additional columns
stagedData = stagedData.withColumn("DrgKey",sha_256(col("DrgCode"))).withColumn("LoadProcess",lit("TestLoadProcess"))
                                   .withColumn("RecordSource",lit("TestRecordSource"))
//Join and filter out existing hub records
val dff = stagedData.join(existingHub, col("DrgKey")===col("ExistingDrgKey"), "left_outer").filter(existingHub.col("ExistingDrgKey").isNull).drop("ExistingDrgKey") 

//Bulk insert
val bulkCopyConfig = Config(Map( 
"url" -> dwServer, 
"databaseName" -> dwDatabase, 
"user" -> dwUser, 
"password" -> dwPass, 
"dbTable" -> "raw.HubDrg_TEST", 
"bulkCopyBatchSize" -> "2000", 
"bulkCopyTableLock" -> "false", 
"bulkCopyTimeout" -> "0" 
)) 

dff.bulkCopyToSqlDB(bulkCopyConfig) 

The problem I'm seeing is that the datetime value I'm selecting as getdate() as LoadDate is giving me this error when attempting to insert into the above mentioned table: SqlNativeBufferBufferBulkCopy.WriteTdsDataToServer, error in OdbcDone: SqlState: 42000, NativeError: 4816, 'Error calling: bcp_done(this->GetHdbc()) | SQL Error Info: SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column type from bcp client for colid 2. | Error calling: pConn->Done() | state: FFFF, number: 58673, active connections: 9', Connection String: Driver={pdwodbc17e};app=TypeC01-DmsNativeWriter:DB66\mpdwsvc (13056)-ODBC;trusted_connection=yes;autotranslate=no;server=\\.\pipe\DB.66-a313018f1e5b\sql\query;database=Distribution_15

Even when attempting to not use a datetime value from the SQL Server query and changing the LoadDate value to: withColumn("LoadDate",current_timestamp()) ,attempting to use the current_timestamp builtin function in spark, it still doesn't work.

I saw this stackoverflow article, which is a similar question, but it still did not answer the question. Does anyone have a good example on how to insert into an SQL Server table with a datetime datatype using the com.microsoft.azure.sqldb.spark.bulkcopy._ library?

Here is a sample of the data from doing a dff.show()

    +-------+--------------------+--------------------+---------------+----------------+
    |DrgCode|            LoadDate|              DrgKey|    LoadProcess|    RecordSource|
    +-------+--------------------+--------------------+---------------+----------------+
    |    390|2019-07-02 09:05:...|48a1a756f2d83f1dc...|TestLoadProcess|TestRecordSource|
    |     18|2019-07-02 09:05:...|4ec9599fc203d176a...|TestLoadProcess|TestRecordSource|
    |    481|2019-07-02 09:05:...|51d089cdaf0c968c9...|TestLoadProcess|TestRecordSource|
    |    460|2019-07-02 09:05:...|841a05fd378a2c067...|TestLoadProcess|TestRecordSource|
    |    838|2019-07-02 09:05:...|cef5838d118dccd9d...|TestLoadProcess|TestRecordSource|
    |     61|2019-07-02 09:05:...|d029fa3a95e174a19...|TestLoadProcess|TestRecordSource|
    |    807|2019-07-02 09:05:...|fce86e339dc3131c4...|TestLoadProcess|TestRecordSource|
    |     44|2019-07-02 09:05:...|71ee45a3c0db9a986...|TestLoadProcess|TestRecordSource|
    |    267|2019-07-02 09:05:...|8acc23987b8960d83...|TestLoadProcess|TestRecordSource|
    |    222|2019-07-02 09:05:...|9b871512327c09ce9...|TestLoadProcess|TestRecordSource|
    |    934|2019-07-02 09:05:...|a8443b1426652157e...|TestLoadProcess|TestRecordSource|
    |    677|2019-07-02 09:05:...|2782526eaa0c5c254...|TestLoadProcess|TestRecordSource|
    |    701|2019-07-02 09:05:...|290a0b92873bdf4e4...|TestLoadProcess|TestRecordSource|
    |    441|2019-07-02 09:05:...|2dfe70c43208f52b9...|TestLoadProcess|TestRecordSource|
    |    439|2019-07-02 09:05:...|50a010ce24d089605...|TestLoadProcess|TestRecordSource|
    |    883|2019-07-02 09:05:...|3055e0d8130c7a197...|TestLoadProcess|TestRecordSource|
    |    947|2019-07-02 09:05:...|4d0198f4905a08812...|TestLoadProcess|TestRecordSource|
    |    369|2019-07-02 09:05:...|5f193b350c8aba488...|TestLoadProcess|TestRecordSource|
    |     21|2019-07-02 09:05:...|6f4b6612125fb3a0d...|TestLoadProcess|TestRecordSource|
    |    503|2019-07-02 09:05:...|7182dd431b5c8833e...|TestLoadProcess|TestRecordSource|
    +-------+--------------------+--------------------+---------------+----------------+
    only showing top 20 rows

dff:org.apache.spark.sql.DataFrame
DrgCode:string
LoadDate:timestamp
DrgKey:string
LoadProcess:string
RecordSource:string
jymbo
  • 1,335
  • 1
  • 15
  • 26
  • was the error you got any different when you used `withColumn("LoadDate",current_timestamp())` – uh_big_mike_boi Jul 02 '19 at 08:14
  • @big_mike_boiii Just ran it with that code and: SqlNativeBufferBufferBulkCopy.WriteTdsDataToServer, error in OdbcDone: SqlState: 42000, NativeError: 4816, 'Error calling: bcp_done(this->GetHdbc()) | SQL Error Info: SrvrMsgState: 1, SrvrSeverity: 16, Error <1>: ErrorMsg: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column type from bcp client for colid 2. | Error calling: pConn->Done() | state: FFFF, number: 91858, active connections: 5', ... – jymbo Jul 02 '19 at 08:25
  • what happens when you force the `withColumn("LoadDate",current_date())` – uh_big_mike_boi Jul 02 '19 at 08:49
  • Same error as before :( – jymbo Jul 02 '19 at 09:01
  • I wonder if you forced it in to a java.sql.Date if it would handle it – uh_big_mike_boi Jul 02 '19 at 09:06
  • Even when changing the code to withColumn("LoadDate",to_timestamp(col("LoadDate"),"yyyy-MM-dd'T'HH:mm:ss").cast(DateType)) to make it a Date type, it still fails – jymbo Jul 02 '19 at 19:50
  • I think its a bug in the bulk loader...this code using standard JDBC allows the datetime (timestamp) value to be inserted into that table with no problem: dff.write .format("com.databricks.spark.sqldw") .option("url", sqlDwUrlSmall) .option("forwardSparkAzureStorageCredentials", "true") .option("dbTable", "raw.HubDrg_TEST") .option("tempDir", tempDir) .mode("append") .save() – jymbo Jul 03 '19 at 00:51
  • 4
    I'm not sure, it it may have something to do with the current limitation of the bulk insert API, i.e. some data types are not supported. https://learn.microsoft.com/en-us/sql/connect/jdbc/use-bulk-copy-api-batch-insert-operation?view=sql-server-ver15#known-limitations – Dimitri B Feb 04 '20 at 05:30
  • @DimitriB You are correct. According to the link you posted "Due to the limitations of Bulk Copy API, MONEY, SMALLMONEY, DATE, DATETIME, DATETIMEOFFSET, SMALLDATETIME, TIME, GEOMETRY, and GEOGRAPHY data types, are currently not supported for this feature." – jymbo Feb 10 '20 at 00:08

0 Answers0