5

I have a bunch of json snappy compressed files in HDFS. They are HADOOP snappy compressed (not python, cf other SO questions) and have nested structures.

Could not find a method to load them into into HIVE (using json_tuple) ?

Can I get some ressources/hints on how to load them

Previous references (does not have valid answers)

pyspark how to load compressed snappy file

Hive: parsing JSON

leftjoin
  • 36,950
  • 8
  • 57
  • 116
tensor
  • 3,088
  • 8
  • 37
  • 71

2 Answers2

4
  1. Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:

set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
 


CREATE EXTERNAL TABLE mydirectory_tbl(
  id   string,
  name string
)
ROW FORMAT SERDE
  'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
  1. JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.

  2. If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.

  3. All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • The part is enable json snappy reading is TableProperties ? – tensor Oct 23 '20 at 13:51
  • Does json serde required to read snappy (yes or not)? Cannot install external jar on my env. – quantCode Oct 25 '20 at 03:55
  • @quantCode Not required if files are CSV, TSV, ctrl_A separated. JSONSerDe is required only to read JSON. Does not matter compressed with SNAPPY or not compressed. If you need to read JSON and cannot add JAR, you can try native Hive JSONSerDe : 'org.apache.hive.hcatalog.data.JsonSerDe'. It should be already installed. Read more details here: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html – leftjoin Oct 25 '20 at 07:47
  • @quantCode SNAPPY is a compression codec. Hive natively can compress and decompress SNAPPY. Compression codec and JSON have nothing in common. CSV files also can be compressed using SNAPPY. – leftjoin Oct 25 '20 at 07:57
  • 1
    ok, thanks. My main issue was the snappy part of the compressed json. – tensor Oct 25 '20 at 14:55
2

If your data is partitioned (ex. by date)

Create the table in Hive

CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
  filename STRING,
  cnt BIGINT,
  size DOUBLE
) PARTITIONED BY (   \`date\` STRING ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'

Recover the partition (before the recovery, the table seems to be empty)

MSCK REPAIR TABLE database.table

François B.
  • 1,096
  • 7
  • 19