0

I have an external table with below DDL

CREATE EXTERNAL TABLE `table_1`(
  `name` string COMMENT 'from deserializer', 
  `desc1` string COMMENT 'from deserializer', 
  `desc2` string COMMENT 'from deserializer', 
  )
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'quoteChar'='\"', 
  'separatorChar'='|', 
  'skip.header.line.count'='1') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://temp_loc/temp_csv/'
TBLPROPERTIES (
  'classification'='csv', 
  'compressionType'='none', 
  'typeOfData'='file')

The csv files that this table reads are UTF-16 LE encoded when trying to render the output using Athena the special characters are being displayed as question marks in the output. Is there any way to set encoding in Athena or to fix this.

Infinite
  • 704
  • 8
  • 27
  • Did you look into [lazySimpleSerde](https://docs.aws.amazon.com/athena/latest/ug/lazy-simple-serde.html)? It supports `serialization.encoding` which may help, also see this [post](https://stackoverflow.com/questions/36283001/hive-utf-8-encoding-number-of-characters-supported) – Philipp Johannis Oct 02 '20 at 19:25

1 Answers1

0

The solution, as Philipp Johannis mentions in a comment, is to set the serialization.encoding table property to "UTF-16LE". As far as I can see LazySimpleSerde uses java.nio.charset.Charset.forName, so any encoding/charset name accepted by Java should work.

Theo
  • 131,503
  • 21
  • 160
  • 205