How to read quoted CSV with NULL values into Amazon Athena

Question

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:

CSV:

id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"

CREATE TABLE DEFINITION:

CREATE EXTERNAL TABLE schema.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT 
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'

CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.

I tried making the CSV look like this (quote empty string):

"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"

But it's not working.

Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.

Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.

It works, when you specify all columns as STRING but that's not what I need.

Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?

Zerodf · Accepted Answer · 2018-06-07T03:42:17.300

5

Quick and dirty way to handle these data:

CSV:

id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"

DDL:

CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;

The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.

I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.

I will work on the regex later.

Edit:

Regex as promised:

CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;

Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.

Further Reading:

Stackoverflow - remove surrounding quotes from fields while loading data into hive

Athena - OpenCSVSerDe for Processing CSV

edited Jun 07 '18 at 03:42

answered Jun 06 '18 at 22:00

Zerodf

2,208
18
26

Does anyone know what version of Hive is being used by Athena? – Zerodf Jun 07 '18 at 03:42
Indeed the RegexSerDe works as expected, thanks! I'm only wondering, since Athena is just an abstraction layer on top of the existing data and I imagine the file is parsed each time it is read, are there any performance differences between using RegexSerDe and OpenCSVSerDe? – Mikolaj Jun 07 '18 at 11:15
That is a really interesting question. I've been meaning to setup some Athena benchmarking tests and to be honest, I've never considered the differences between different Serializer/Deserializers. – Zerodf Jun 07 '18 at 12:00
1

I have not ran any serious benchmarks but from what I've tested the performance of the regex serde is extremely poor comparing to OpenCSV. I have a table that has 5kk rows and 20 columns of various types. If I run a simple count(*) on the table created with regex the query runs for over 20 minutes. For the CSV it's a matter of seconds. But who knows, I might be doing something else wrong. Cheers. – Mikolaj Jun 20 '18 at 14:19
Would this regex method work for any number of columns? How do you interpret regex pattern? – Akhan Apr 07 '20 at 08:03

score 2 · Answer 2 · answered Jul 16 '21 at 13:48

Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.

You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.

Or you can use LazySimpleSerDe and strip quotes at query time.

I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.

You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV

score 0 · Answer 3 · answered Nov 24 '21 at 21:42

0

This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

answered Nov 24 '21 at 21:42

AlizaminJ

95
1
5

How to read quoted CSV with NULL values into Amazon Athena

3 Answers3