Let me try to explain a few problems that I see on front.
- It looks like your desired output expect some data which is part of the path file location, device and sensor, however it is not defined as part of your table definition, only columns in the table definition or virtual columns will be available.
- Several small files could impact the performance of your queries (but this will not affect your desired result)
- Hive partitions are used to improve queries' performance avoiding scan the
all the data. Partitions point to folders, in this case you are trying to access to specific files
- Your desired output is basically exploding 1 record in several records, this shouldn't be handled at table definition, can be done through your select statement
- Hive partitions have the naming convention of
partitionname=partitionvalue
, this is not mandatory but useful if you want to take advance of commands to automatically add partitions based on your folder structures.
This is how I would solve your problem if you will mainly query by sensor or device
Change structure of your data
Your folder sturcture ideally should go from
farm_iot/sensor_data/farm/farm0001/sensor01/1541252701443
to farm_iot/sensor_data/farm/device=farm0001/sensor=sensor01/1541252701443
Change your table definition
Your table definition should contain your partition locations to be able to select it without regex and take advantage of the performance improvement of it (I am guessing a common query will filter by device or sensor. Additional to that you need to add all your json columns that are part of your file
CREATE EXTERNAL TABLE IF NOT EXISTS farm.sensor_data (
temperature double,
preassure double,
humidity double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://farm-iot/sensor_data/farm/'
PARTITIONED BY (device string, sensor string)
TBLPROPERTIES ('has_encrypted_data'='false')
Querying your data
We are missing the timestamp which is essentially part of your filename with the json input. We can include the file name during the select statement using virtual column INPUT__FILE__NAME
as follow
select device, sensor, temperature, preassure, humidity, INPUT__FILE__NAME as mytimestamp from farm.sensor_data
If you want preassure, temperature and humidity and different rows, I would recommend create an array with those three and explode it, it should be much efficient that run 3 queries using UNION ALL to append the results
Adding new partition
If you follow Hive convention, you can take advantage of the command msck repair table to automatically add new partitions once new devices/sensors are included. In the worst case if you want to keep your folder structure, you can add partitions as follow
ALTER TABLE test ADD PARTITION (device='farm0001', sensor='sensor01') location 's3://farm_iot/sensor_data/farm/farm0001/sensor01'
NOTE: new partitions will not be automatically added, you always need to add them
I tried to add as much detail as possible. If something is not clear let me know.
EDIT:
If your queries will be mostly based on timeseries (date range for example) I would recommend add a partition at day level (not smaller than this) to improve the performance of your queries. So your table definition would looks like
CREATE EXTERNAL TABLE IF NOT EXISTS farm.sensor_data (
temperature double,
preassure double,
humidity double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://farm-iot/sensor_data/farm/'
PARTITIONED BY (dt=long, device string, sensor string)
TBLPROPERTIES ('has_encrypted_data'='false')
And your folder structure would looks like
farm_iot/sensor_data/farm/dt=20191204/device=farm0001/sensor=sensor01/1541252701443
As clarification, you do not need to modify the table for each new partition, only add this partitions to the table, this is essentially how Hive will know that a new partition was created. If you decide to use partitions, this is the only way, if you don't (this will impact the performance), there are some other alternatives to make it work
EDIT2:
If you want to keep your data structure as is and do not use partitions, it is possible to get the expected results as follow
CREATE EXTERNAL TABLE IF NOT EXISTS yourdb.sensordata (
temperature double,
pressure double,
humidity double
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
LOCATION 's3://farm-iot/sensor_data/farm/'
TBLPROPERTIES ('has_encrypted_data'='false');
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
select * from yourdb.sensordata;
select
split(input__file__name, "/")[size(split(input__file__name, "/")) - 1] as ts,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 3] as device,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 2] as sensor,
'temperature' as data_point,
temperature as value
from yourdb.sensordata
union all
select
split(input__file__name, "/")[size(split(input__file__name, "/")) - 1] as ts,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 3] as device,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 2] as sensor,
'pressure' as data_point,
pressure as value
from yourdb.sensordata
union all
select
split(input__file__name, "/")[size(split(input__file__name, "/")) - 1] as ts,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 3] as device,
split(input__file__name, "/")[size(split(input__file__name, "/")) - 2] as sensor,
'humidity' as data_point,
humidity as value
from yourdb.sensordata;
As you see, I am getting the most of the information from the file path, however it is required to set some flags to tell Hive read folders recursively
ts,device,sensor,_data_point,value
1541252701443,farm0001,sensor01,temperature,14.78
1541252701443,farm0001,sensor01,pressure,961.7
1541252701443,farm0001,sensor01,humidity,68.32