Load special characters in AWS Spectrum Table

Question

I am trying to create an external Spectrum table on top of plain text files but some values are considered as null because they contain special characters.

Create statement:

create external table s.table_1
(
    id bigint,
    city varchar(100)
)
partitioned by (file_date varchar(25))
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    with serdeproperties ('quoteChar'='\"', 'separatorChar'=';', 'serialization.format'='1') stored as
    inputformat 'org.apache.hadoop.mapred.TextInputFormat'
    outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://my_bucket/my_prefix/'
table properties ('skip.header.line.count'='1');

But the results I am getting while selecting are:

id,city
1,<null>
2,<null>
3,Erfurt
4,Erfurt
5,Lahr

Instead of the expected:

id,city
1,Mönchengladbach
2,Mönchengladbach
3,Erfurt
4,Erfurt
5,Lahr

I tried to:

set serdeproperties 'serialization.format'= '1' or 'UTF-8' or 'Latin-1' or 'ISO 8859-1'
set row format delimited fields terminated by ';' and change the serialization.format of table properties

but none of the above worked. I am just copying the file and it would suit me to load it without having to transform it in advance. Any suggestions? Thank you in advance.

I don't have a cluster to hand, but isn't text encoding set in the cluster properties? maybe that influences Spectrum. In any event, you've discovered one of the key reasons for not using Spectrum; silent errors. — , Jul 01 '21 at 13:27

Load special characters in AWS Spectrum Table

0 Answers0