How do you filter multi-line JSON data from going into AWS Hive table

Question

I have an AWS IoT rule that sends incoming JSON to a Kinesis Firehose.

The JSON data from my IoT publish is all on one line - ex:

{"count":4950, "dateTime8601": "2017-03-09T17:15:28.314Z"}

The IoT "Test" section in the admin UI allows you to Publish a message, defaults to following (note formatted multi-line JSON):

{
  "message": "Hello from AWS IoT console"
}

I am streaming the Firehose to S3, which is then converted by EMR to a columnar format to ultimately be used by Athena.

The problem is, during conversion to columnar format, Hive (specifically the JSON SerDe) can not handle a JSON object that spans more than one line. It will blow up the conversion, and not convert the good, single line, JSON records.

My question is:

How do you setup FireHose to ignore multi-line JSON?
If not possible, how to you tell Hive to remove newlines before loading into table or at least catch exceptions and try to continue?

I am already trying to ignore malformed JSON when defining the Hive table:

DROP TABLE site_sensor_data_raw;
CREATE EXTERNAL TABLE site_sensor_data_raw (
 count int,     
 dateTime8601 timestamp
)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties (
 'ignore.malformed.json' = 'true',
 "timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',millis"
)
LOCATION 's3://...';

Here is my full HQL that does conversion:

--Example of converting to OEX/columnar formats
DROP TABLE site_sensor_data_raw;
CREATE EXTERNAL TABLE site_sensor_data_raw (
    count int,
    dateTime8601 timestamp
)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties (
 'ignore.malformed.json' = 'true',
 "timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',millis"
)
LOCATION 's3://bucket.me.com/raw/all-sites/';

ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='15') location 's3://bucket.me.com/raw/all-sites/2017/03/09/15';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='16') location 's3://bucket.me.com/raw/all-sites/2017/03/09/16';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='17') location 's3://bucket.me.com/raw/all-sites/2017/03/09/17';

DROP TABLE to_orc;
CREATE EXTERNAL TABLE to_orc (
      count int,
      dateTime8601 timestamp
)
STORED AS ORC
LOCATION 's3://bucket.me.com/orc'
TBLPROPERTIES ("orc.compress"="ZLIB");

INSERT OVERWRITE TABLE to_orc SELECT count,dateTime8601 FROM site_sensor_data_raw where year=2017 AND month=03 AND day=09 AND hour=15;

score 3 · Answer 1 · answered May 10 '17 at 21:02

Well, the default JSON SerDe's used on EMR and Athena cannot work on multi-line json records. Each JSON record should be on a single line.

On multi line JSON , I see two problems from Hive/Hadoop and even Presto's(used in Athean) perspective

Given a file, its apparent that Hive/Hadoop and the JSON serde's will not be able to recognize the end and beginning of a json record to return its object representation.
Given multiple files, the multi-line JSON files are not splittable like normal /n delimited JSON files.

To get around this issue from EMR/Athena end, you will need to write your own custom SerDe's based on your data structure and catch exceptions etc.

How do you setup FireHose to ignore multi-line JSON?

Firehose doesn't have ability to ignore a specific format. It will take whatever that is being put in using its API(PutRecord or PutRecordBatch) as data blob and it will send it to destination.

http://docs.aws.amazon.com/firehose/latest/APIReference/API_PutRecordBatch.html

Regardless, AWS Firehose offers Data Transformation with AWS Lambda where you can use Lambda functions to transform your data incoming data on Firehose and put the transformed data to destination. So, you might use that feature to recognize and flatten multi-line JSON before hand. You might also drop the records if they are not properly formatted etc. You will need to explore how IOT sends the multi-line json data to firehose(like line by line etc. ) to write your own function.

https://aws.amazon.com/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/

If not possible, how to you tell Hive to remove newlines before loading into table or at least catch exceptions and try to continue?

If you still have multi-line JSON in your firehose destination, Since you have EMR in your ETL, you can use its compute instead of Lambda to flatten out JSON. This feature on spark can help you on that as well. https://issues.apache.org/jira/browse/SPARK-18352

Then you can ingest this data to create to a columnar format for Athena to work on it.

How do you filter multi-line JSON data from going into AWS Hive table

1 Answers1