11

How can I insert overwrite directory with json schema?

There is raw hive avro table; (this is actually has many fields)

tb_test--------
name string
kickname string
-----------------

then I want to save query result into some directory in hdfs by jsonserde.

I tried this.

insert overwrite directory '/json/'
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
 "stat_name"="$._col0",
 "stat_interval"="$._col1"
)
STORED AS TEXTFILE 
select name, nickname
from tb_test limit 100

But written json in /json/ has _colXX field name instead of the origin field name.

{"_col0":"basic_qv"," _col1":"h"}
{"_col0":"basic_qv","_col1 ":"h"}
{"_col0":"basic_qv","_col1 ":"h"}
{"_col0":"basic_qv"," _col1":"h"}
{"_col0":"basic_qv","_col1 ":"h"}

I expected

{"name":"basic_qv","nickname":"h"}
{"name":"basic_qv","nickname":"h"}
{"name":"basic_qv","nickname":"h"}
{"name":"basic_qv","nickname":"h"}
{"name":"basic_qv","nickname":"h"}

What will help this?

Thanks!!

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Jihun No
  • 1,201
  • 1
  • 14
  • 29
  • I'm facing a similar issue with AVRO format. Have you found the solution? Please let me know if you know any workarounds. – Munesh Jul 11 '17 at 21:54
  • is avro output fotmar or table format? I couldnt find any workaround on hive. Instead I did this by spark. Run spark sql then export dataframe as json. spark support various export format in a nature. – Jihun No Jul 11 '17 at 23:49
  • AVRO is the required output file format. Yes, Spark is a good workaround. – Munesh Jul 12 '17 at 23:43
  • @JihunNo could you share a sample avro file that you are trying out ? – DataWrangler Oct 25 '17 at 14:06

2 Answers2

2

It seems your issue with a workaround (using JsonUDF with named_struct) is described here: https://github.com/rcongiu/Hive-JSON-Serde/issues/151

extract.hql:
add jar /home/myuser/lib/json-udf-1.3.8-SNAPSHOT-jar-with-dependencies.jar;
create temporary function tjson as 'org.openx.data.udf.JsonUDF';

insert overwrite local directory '/json/'
select
tjson(named_struct("name", t.name,"nickname", t.nickname))
from tb_test t
;

Also you can create JsonSerDe-based table with columns defined, insert overwrite it and use table location instead of directory.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
1

I ran into this today. In case you don't want/can't use UDFs, using this guideline: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html, this worked for me (order of column should be different also using the mapping prefix:

insert overwrite directory '/json/'
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
 "case.insensitive" = "false",
 "mapping._col0" = "stat_name",
 "mapping._col1" = "stat_interval"
)
select name, nickname
from tb_test limit 100
Shervin
  • 409
  • 7
  • 13