I have an AWS IoT rule that sends incoming JSON to a Kinesis Firehose.
The JSON data from my IoT publish is all on one line - ex:
{"count":4950, "dateTime8601": "2017-03-09T17:15:28.314Z"}
The IoT "Test" section in the admin UI allows you to Publish a message, defaults to following (note formatted multi-line JSON):
{
"message": "Hello from AWS IoT console"
}
I am streaming the Firehose to S3, which is then converted by EMR to a columnar format to ultimately be used by Athena.
The problem is, during conversion to columnar format, Hive (specifically the JSON SerDe) can not handle a JSON object that spans more than one line. It will blow up the conversion, and not convert the good, single line, JSON records.
My question is:
- How do you setup FireHose to ignore multi-line JSON?
- If not possible, how to you tell Hive to remove newlines before loading into table or at least catch exceptions and try to continue?
I am already trying to ignore malformed JSON when defining the Hive table:
DROP TABLE site_sensor_data_raw;
CREATE EXTERNAL TABLE site_sensor_data_raw (
count int,
dateTime8601 timestamp
)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties (
'ignore.malformed.json' = 'true',
"timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',millis"
)
LOCATION 's3://...';
Here is my full HQL that does conversion:
--Example of converting to OEX/columnar formats
DROP TABLE site_sensor_data_raw;
CREATE EXTERNAL TABLE site_sensor_data_raw (
count int,
dateTime8601 timestamp
)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties (
'ignore.malformed.json' = 'true',
"timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',millis"
)
LOCATION 's3://bucket.me.com/raw/all-sites/';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='15') location 's3://bucket.me.com/raw/all-sites/2017/03/09/15';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='16') location 's3://bucket.me.com/raw/all-sites/2017/03/09/16';
ALTER TABLE site_sensor_data_raw ADD PARTITION (year='2017',month='03',day='09',hour='17') location 's3://bucket.me.com/raw/all-sites/2017/03/09/17';
DROP TABLE to_orc;
CREATE EXTERNAL TABLE to_orc (
count int,
dateTime8601 timestamp
)
STORED AS ORC
LOCATION 's3://bucket.me.com/orc'
TBLPROPERTIES ("orc.compress"="ZLIB");
INSERT OVERWRITE TABLE to_orc SELECT count,dateTime8601 FROM site_sensor_data_raw where year=2017 AND month=03 AND day=09 AND hour=15;