How to set encoding to UTF16 in PySpark Oracle JDBC Connection

Question

I'm trying to load the data from Oracle to Databricks but I encountered a Unicode character issue in PySpark it can't encode the Unicode character as per the format available in Oracle in fact it displays the replacement character as '▯'.In oracle the NLS_NCHAR_NCHARACTERSET=AL16UTF16.

I tried the Inserting national characters into an oracle NCHAR or NVARCHAR column does not work Oracle JDBC system property but it doesn't work in my case. I request you to please provide an alternative to fix this issue.

score 0 · Answer 1 · answered Jan 10 '22 at 10:20

0

option 1 - Use a jdbc option when you read data with spark:

spark.read.format("jdbc")...option("useUnicode", true).option("characterEncoding", "UTF-16")

option 2 - Use proper connection string:

url = "...?useUnicode=true&characterEncoding=UTF-16"
spark.read.format("jdbc").option("url", url)

answered Jan 10 '22 at 10:20

YuriR

1,251
3
14
26

Hi @YuriR it works for SQL but not for Oracle I tried both the options. Can you suggest any alternative please. – Ahnvi Jan 10 '22 at 10:31
@Ahnvi is your table defined with utf-16 support? Try to query your db with pure python or java and see if you get correct characters. – YuriR Jan 10 '22 at 10:48

How to set encoding to UTF16 in PySpark Oracle JDBC Connection

1 Answers1