1

I have a parquet file which is stored in a partitioned directory. The format of the partition is

/dates=*/hour=*/something.parquet.

The content of parquet file looks like as follows:

{a:1,b:2,c:3}.

This is json data and i want to create external hive table.

My approach:

CREATE EXTERNAL TABLE test_table (a int, b int, c int) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';

After that i run MSCK REPAIR TABLE test_table; but i get following output:

hive> select * from test_table;
OK
NULL    NULL    NULL    2021-09-27      09

The other three columns are null. I think i have to define JSON schema somehow but i have no idea how to proceed further.

user4157124
  • 2,809
  • 13
  • 27
  • 42
Khan Saab
  • 434
  • 1
  • 8
  • 20
  • If the data is in parquet, then the schema is already defined in parquet file and table DDL should match parquet schema. if you have text file with strings like {a:1,b:2,c:3}, the json is not valid one and you can not use JSON SerDe. ,In such case create table with single column and parse strings in sql. Are you sure the format is parquet? – leftjoin Oct 14 '21 at 08:12
  • @leftjoin Yes, a json string stored in .parquet file. There are only three columns. dates,hour and value. Partitioning is done on dates and hour. Hence, we are only left with value field. – Khan Saab Oct 14 '21 at 08:34
  • @leftjoin The JSON is valid. May be i wrote it here incorrectly. – Khan Saab Oct 14 '21 at 08:54
  • keys in json should be quoted. If it is parquet, then create table with the same schema: dates,hour (partitions) and value(single column) and parse value in query using get_json_object or lateral view with json_tuple. – leftjoin Oct 14 '21 at 09:04

2 Answers2

2

Create table with the same schema as parquet file:

CREATE EXTERNAL TABLE test_table (value string) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';

Run repair table to mount partitions:

MSCK REPAIR TABLE test_table;

Parse value in query:

select e.a, e.b, e.c
  from test_table t 
       lateral view json_tuple(t.value, 'a', 'b', 'c') e as a,b,c

Cast values as int if necessary: cast(e.a as int) as a

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • There are many JSON fields and parsing in query eachtime is not feasible. Is it possible to build a table on the top of first table? Or is it computationally very expensive? – Khan Saab Oct 14 '21 at 09:18
  • Not possible. If it is already in parquet, you should use table with the same schema to be able to read parquet files. Parquet mandates using parquet serde, it is responsible for reading parquet format, it does not parse JSON columns. If you have only single json value, better store it as text file, not parquet, one json per row. In such case you can create table using JSONSerDe and define columns to be extracted and SerDe will extract all columns. If it is parquet - the only option is to parse in query using json_tuple or get_json_object. json_tuple extracts all columns in single call per row – leftjoin Oct 14 '21 at 09:29
  • Thanks. I got the point. I will save it as textFile and provide information as how fields are terminated. Hopefully, that should work. – Khan Saab Oct 14 '21 at 10:09
  • @KhanSaab You can use JSONSerDe. See: https://stackoverflow.com/a/64467399/2700344 – leftjoin Oct 14 '21 at 17:18
0

You can also create a table for json fields as columns using this:

CREATE EXTERNAL TABLE IF NOT EXISTS test_table(
    a INT,
    b INT,
    c INT)
partitioned by (dates string, hour string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS PARQUET
location '/user/output/';

Then run MSCK REPAIR TABLE test_table;

You would be able to query directly without writing any parsers.

Manasvi Batra
  • 591
  • 7
  • 12