How do I identify problematic documents in S3 when querying data in Athena?

Question

I have a basic Athena query like this:

SELECT *
FROM my.dataset LIMIT 10

When I try to run it I get an error message like this:

Your query has the following error(s):

HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"

How do I identify the S3 document that has the invalid field?

My documents are JSON.

My table looks like this:

CREATE EXTERNAL TABLE my.data (
  `id` string,
  `timestamp` string,
  `profile` struct<
    `name`: string,
    `score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');

Athena is based on Presto. You can use Presto directly (eg https://www.starburstdata.com/presto-aws-cloud/ or on Presto on EMR) and it will report problematic file's path in the error response. (You may need to use `--debug` option in Presto CLI to get full error response, or get it from Presto Web UI.). I don't know how to get the same information from Athena directly. — Piotr Findeisen, Nov 18 '19 at 22:01
What is the underlaying file format for your data? Is it JSON/CSV? What is your DDL looks like? — shuvalov, Nov 19 '19 at 09:42
Have you come across [this post](https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/)? It suggests to use `SELECT "$PATH", * FROM your_table WHERE your_column is NULL` to identify the files with malformed records — Ilya Kisil, Nov 19 '19 at 14:46
@IlyaKisil I couldn't find that bit. I also don't know which column is the issue. — sdgfsdh, Nov 19 '19 at 14:51
From [another your post](https://stackoverflow.com/questions/58936088/how-to-skip-documents-that-do-not-match-schema-in-athena), I assumed that you know which column is malformed, i.e. `profile.score` (a string rather than an integer). — Ilya Kisil, Nov 19 '19 at 15:00

Ilya Kisil · Answer 1 · 2019-11-19T22:11:19.763

Inconsistent schema

Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files

// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos",  "age":"eleven"}
{"name":"1Fabiana", "age":22}

// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos",  "age":11}
{"name":"2Fabiana", "age":22}

Then a simple query SELECT * FROM some_table will fail with

HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"

However, we can exclude that file within WHERE clause

SELECT 
    "$PATH" AS "source_s3_file", 
    * 
FROM some_table 
WHERE "$PATH" != 's3://path/to/bad.json'

Result:

        source_s3_file | name     | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos  | 11
s3://path/to/good.json | 1Fabiana | 22

Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.

SELECT 
    COUNT(*)
FROM some_table 
WHERE regexp_like("$PATH",  's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.

The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.

Malformed records

In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }

AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying

-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')

when you create table. Thus, the following query will reveal files with malformed records:

SELECT 
    DISTINCT("$PATH")
FROM "some_database"."some_table" 
WHERE(
    col_1 IS NULL AND 
    col_2 IS NULL AND 
    col_3 IS NULL
    -- etc
)

Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.

In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed For example if a file contains:

{"name": "2Patrick","age": 35,"address": "North Street"}
{
    "name": "2Carlos",
    "age": 11,
    "address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}

the following query will still succeed

SELECT 
    "$PATH" AS "source_s3_file",
    *
FROM some_table

Result:

              source_s3_file |     name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35  | North Street
2 s3://path/to/malformed.json|          |     |
3 s3://path/to/malformed.json|          |     |
4 s3://path/to/malformed.json|          |     |
5 s3://path/to/malformed.json|          |     |
6 s3://path/to/malformed.json|          |     |
7 s3://path/to/malformed.json| 2Fabiana | 22  | Main Street

While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

Can you verify that your table use `'org.openx.data.jsonserde.JsonSerDe'` and has SerDe property `'ignore.malformed.json' = 'true'`. Both [AWS docs](https://docs.aws.amazon.com/athena/latest/ug/json.html#openxjson) and [Hive-JSON-Serde](https://github.com/rcongiu/Hive-JSON-Serde#importing-malformed-data) tell that they support it. — Ilya Kisil, Nov 19 '19 at 15:28
On the other hand, maybe this approach only works if the file structure is malformed, e.g. more then one record per line, has commas at the end of the line or json is in unminifyied version. Whereas in your case all your records are prefectly fine from JSON point of view, but they are just inconsistent. — Ilya Kisil, Nov 19 '19 at 15:36
I can confirm those properties are enabled. The JSON is well formed, but there is a string where an int should be. — sdgfsdh, Nov 19 '19 at 15:38
All I can think atm, is manual approach (which could be expensive if you have a lot of data). Try to do `SELECT COUNT(*) FROM your_table WHERE regexp_like("$PATH", 'regex for directories')`. If this doesn't produce error then the files on that regex are ok. — Ilya Kisil, Nov 19 '19 at 15:44
This doesn't answer the question, how to find those malformed events — ArielB, Dec 22 '21 at 11:41

How do I identify problematic documents in S3 when querying data in Athena?

1 Answers1

Inconsistent schema

Malformed records

Linked