I have a usecase where I want to read data from one Oracle table where all fields are varchar type and save it to another Oracle table with similar fields but with ideally correct datatype. This has to be done only in java. So I want to read Dataset from below table:
create table employeeStr (
name varchar2(50),
empid varchar2(50),
age varchar2(50),
salary varchar2(50),
dt_joined varchar2(50));
and write to below table:
create table employeeNorm (
name varchar2(50),
empid number,
age number(3,0),
salary number(10,2),
dt_joined date);
My java code is as below:
SparkSession sparkSession =
SparkSession.builder().master("local[*]").appName("HandlingOracleDataTypes").getOrCreate();
SQLContext sqlContext = sparkSession.sqlContext();
sqlContext.udf().register("toDate", new UDF1<String, java.sql.Date>() {
@Override
public java.sql.Date call(String dateStr) throws Exception {
Date date = new SimpleDateFormat("yyyyMMdd").parse(dateStr);
return new java.sql.Date(date.getTime());
}
}, DataTypes.DateType);
sqlContext.udf().register("toDate2", new UDF1<String, Date>() {
@Override
public Date call(String dateStr) throws Exception {
Date date = new SimpleDateFormat("yyyyMMdd").parse(dateStr);
return date;
}
}, DataTypes.DateType);
sqlContext.udf().register("toDate3", new UDF1<String, String>() {
@Override
public String call(String dateStr) throws Exception {
Date date = new SimpleDateFormat("yyyyMMdd").parse(dateStr);
return new SimpleDateFormat("dd-MMM-yyyy").format(date);
}
}, DataTypes.StringType);
Properties connectionProperties = new Properties();
connectionProperties.put("user", "<username>");
connectionProperties.put("password", "<password>");
String jdbcUrl = "<jdbcurl>";
Dataset<Row> employeeStrDS = sparkSession.read().jdbc(jdbcUrl, "employeeStr", connectionProperties);
employeeStrDS.show();
employeeStrDS.printSchema();
employeeStrDS.withColumn("empid",employeeStrDS.col("empid").cast(DataTypes.IntegerType));
employeeStrDS.withColumn("age",employeeStrDS.col("age").cast(DataTypes.IntegerType));
employeeStrDS.withColumn("salary",employeeStrDS.col("salary").cast(DataTypes.FloatType));
//employeeStrDS.withColumn("dt_joined",employeeStrDS.col("dt_joined").cast(DataTypes.DateType));
//employeeStrDS.registerTempTable("abc");
//sqlContext.sql("select toDate(dt_joined) from abc").show();
employeeStrDS.withColumn("dt_joined", functions.callUDF("toDate3", employeeStrDS.col("dt_joined")));
//employeeStrDS.printSchema();
employeeStrDS.write().mode(SaveMode.Append).jdbc(jdbcUrl, "employeeNorm", connectionProperties);
If I remove the "dt_joined" column from table and code, this code work but when I bring "dt_joined" column into picture nothing works. Tried all the 3 UDFs mentioned in the code but everytime get below exception. Please suggest a solution to this.
Caused by: java.sql.BatchUpdateException: ORA-01861: literal does not match format string
at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296)
at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLDataException: ORA-01861: literal does not match format string
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1059)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:522)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:587)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:225)
Update: The actual scenario is the spark code reads data from Impala, creates dataframes. Impala table has all columns as String. So basically the issue is saving dataframe with schema as all strings into an Oracle Table having ideal datatypes.