1

Suppose I have an external table like this:

CREATE EXTERNAL TABLE my.data (
  `id` string,
  `timestamp` string,
  `profile` struct<
    `name`: string,
    `score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');

A few of my documents have an invalid profile.score (a string rather than an integer).

This causes queries in Athena to fail:

"Status": { "State": "FAILED", "StateChangeReason": "HIVE_BAD_DATA: Error parsing field value for field 0: For input string: \"4099999.9999999995\"",

How can I configure Athena to skip the documents that do not fit the external table schema?


The question here is about finding the problematic documents; this question is about skipping them.

sdgfsdh
  • 33,689
  • 26
  • 132
  • 245

2 Answers2

1

Here is a sample on how to exclude a particular file

SELECT
   * 
FROM 
    "some_database"."some_table"
WHERE(
  "$PATH" != 's3://path/to/a/file'
)

Just tested this approach with

SELECT 
   COUNT(*)
FROM 
    "some_database"."some_table"
-- Result: 68491573

SELECT 
   COUNT(*)
FROM 
    "some_database"."some_table"
WHERE(
  "$PATH" != 's3://path/to/a/file'
)
-- Result: 68041452

SELECT 
   COUNT(*)
FROM 
    "some_database"."some_table"
WHERE(
  "$PATH" = 's3://path/to/a/file'
)
-- Result: 450121

Total: 450121 + 68041452 = 68491573

Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
  • I do not know which files are broken. – sdgfsdh Nov 19 '19 at 15:10
  • In that case, the above approach is pending resolution to your post on [how to identify broken files](https://stackoverflow.com/questions/58919242/how-do-i-identify-problematic-documents-in-s3-when-querying-data-in-athena/58936905#58936905) – Ilya Kisil Nov 19 '19 at 15:38
  • I do not want to have to build a process where I first scan for broken files and then pass those to the query. – sdgfsdh Nov 19 '19 at 15:42
  • Provided that you come up with an query on how to identify those file, you will be able to do that in a single run using for example `WITH` statement. Unless your concern is cost of a query and execution time. – Ilya Kisil Nov 19 '19 at 15:52
0

I have faced same issue. Since I could not found a specific solution, I have used a different approach. It might help you. The error is related to bad data in profile field. Since you are using “struct” for profile field, Athena is expecting the profile field’s data in structured fashion in source files. If there is any bad data in this field, you will experience this error.

Can you try below queries:

CREATE EXTERNAL TABLE my.data (
id string,
timestamp string,
profile string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');

and use below query to get expected result

select 
 id
 ,timestamp
 ,socialdata
 ,json_extract_scalar(profile, '$.name')profile_name
 ,json_extract_scalar(profile, '$.score')profile_score
 from my.data;

You can visit this link for more.

Aftab Ansari
  • 926
  • 9
  • 17