0

I have a usecase where I want to read data from one Oracle table where all fields are varchar type and save it to another Oracle table with similar fields but with ideally correct datatype. This has to be done only in java. So I want to read Dataset from below table:

create table employeeStr (
 name varchar2(50),
 empid varchar2(50),
 age varchar2(50),
 salary varchar2(50),
 dt_joined varchar2(50));

and write to below table:

create table employeeNorm (
 name varchar2(50),
 empid number,
 age number(3,0),
 salary number(10,2),
 dt_joined date);

My java code is as below:

SparkSession sparkSession =
        SparkSession.builder().master("local[*]").appName("HandlingOracleDataTypes").getOrCreate();

SQLContext sqlContext = sparkSession.sqlContext();

sqlContext.udf().register("toDate", new UDF1<String, java.sql.Date>() {
    @Override
    public java.sql.Date call(String dateStr) throws Exception {
        Date date = new SimpleDateFormat("yyyyMMdd").parse(dateStr);
        return new java.sql.Date(date.getTime());
    }
}, DataTypes.DateType);

sqlContext.udf().register("toDate2", new UDF1<String, Date>() {
    @Override
    public Date call(String dateStr) throws Exception {
        Date date = new SimpleDateFormat("yyyyMMdd").parse(dateStr);
        return date;
    }
}, DataTypes.DateType);

sqlContext.udf().register("toDate3", new UDF1<String, String>() {
    @Override
    public String call(String dateStr) throws Exception {
        Date date = new SimpleDateFormat("yyyyMMdd").parse(dateStr);
        return new SimpleDateFormat("dd-MMM-yyyy").format(date);
    }
}, DataTypes.StringType);

Properties connectionProperties = new Properties();
connectionProperties.put("user", "<username>");
connectionProperties.put("password", "<password>");

String jdbcUrl = "<jdbcurl>";

Dataset<Row> employeeStrDS = sparkSession.read().jdbc(jdbcUrl, "employeeStr", connectionProperties);

employeeStrDS.show();
employeeStrDS.printSchema();

employeeStrDS.withColumn("empid",employeeStrDS.col("empid").cast(DataTypes.IntegerType));
employeeStrDS.withColumn("age",employeeStrDS.col("age").cast(DataTypes.IntegerType));
employeeStrDS.withColumn("salary",employeeStrDS.col("salary").cast(DataTypes.FloatType));
//employeeStrDS.withColumn("dt_joined",employeeStrDS.col("dt_joined").cast(DataTypes.DateType));
//employeeStrDS.registerTempTable("abc");
//sqlContext.sql("select toDate(dt_joined) from abc").show();

employeeStrDS.withColumn("dt_joined", functions.callUDF("toDate3", employeeStrDS.col("dt_joined")));
//employeeStrDS.printSchema();
employeeStrDS.write().mode(SaveMode.Append).jdbc(jdbcUrl, "employeeNorm", connectionProperties);

If I remove the "dt_joined" column from table and code, this code work but when I bring "dt_joined" column into picture nothing works. Tried all the 3 UDFs mentioned in the code but everytime get below exception. Please suggest a solution to this.

Caused by: java.sql.BatchUpdateException: ORA-01861: literal does not match format string

    at oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296)
    at oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLDataException: ORA-01861: literal does not match format string

    at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450)
    at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
    at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1059)
    at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:522)
    at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
    at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:587)
    at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:225)

Update: The actual scenario is the spark code reads data from Impala, creates dataframes. Impala table has all columns as String. So basically the issue is saving dataframe with schema as all strings into an Oracle Table having ideal datatypes.

abhihello123
  • 1,668
  • 1
  • 22
  • 38
  • I'd suggest using `spark-shell` and write a code to save just a single row to the Oracle table. With that you know whether the format is correct or not (at the very least). – Jacek Laskowski Oct 23 '17 at 04:54

1 Answers1

2

I don't think you can benefit from using Spark in this case as you would need first to fetch all the data from Oracle DB to your Spark Cluster and then back to Oracle DB. Using SQL you can do everything in place (inside Oracle DB). All you need to do is to execute the following SQL statements (on the Oracle DB side):

insert into employeeNorm
select name, empid, age, salary, to_date(dt_joined, 'yyyy-mm-dd')
from employeeStr;

commit;

you should replace 'yyyy-mm-dd' with the corresponding datetime format - see below for details...

NOTE: date/time format in Oracle's to_date() function is not compatible with standard UNIX format.

Here is a minimal mapping:

Oracle     UNIX
------     ----
YYYY       %Y
YY         %y
MM         %m
DD         %d
HH24       %H
MI         %M
SS         %S

It's up to you how and where to execute those statements - in Java, using sqlplus, etc.

PS if employeeStr is too big to do everything in one transaction, you should consider using BULK INSERT in chunks.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419