7

I'm trying JSON-SerDe from below link http://code.google.com/p/hive-json-serde/wiki/GettingStarted.

         CREATE TABLE my_table (field1 string, field2 int, 
                                     field3 string, field4 double)
         ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;

I've added Json-SerDe jar as

          ADD JAR /path-to/hive-json-serde.jar;

And loaded data as

LOAD DATA LOCAL INPATH  '/home/hduser/pradi/Test.json' INTO TABLE my_table;

and it loads data successfully.

But when query data as

Select * from my_table;

I get only one row from table as

data1 100 more data1 123.001

Test.json contains

{"field1":"data1","field2":100,"field3":"more data1","field4":123.001} 

{"field1":"data2","field2":200,"field3":"more data2","field4":123.002} 

{"field1":"data3","field2":300,"field3":"more data3","field4":123.003} 

{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

Where is the problem? why only one row is coming instead of 4 rows when i query the table. And in /user/hive/warehouse/my_table contains all the 4 rows!!


hive> add jar /home/hduser/pradeep/hive-json-serde-0.2.jar;
Added /home/hduser/pradeep/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradeep/hive-json-serde-0.2.jar

hive> CREATE EXTERNAL TABLE my_table (field1 string, field2 int,
>                                 field3 string, field4 double)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
> WITH SERDEPROPERTIES (
>   "field1"="$.field1",
>   "field2"="$.field2",
>   "field3"="$.field3",
>   "field4"="$.field4"
> );
OK
Time taken: 0.088 seconds

hive> LOAD DATA LOCAL INPATH  '/home/hduser/pradi/test.json' INTO TABLE my_table;
Copying data from file:/home/hduser/pradi/test.json
Copying file: file:/home/hduser/pradi/test.json
Loading data to table default.my_table
OK
Time taken: 0.426 seconds

hive> select * from my_table;
OK
data1   100     more data1      123.001
Time taken: 0.17 seconds

I've already posted the contents of test.json file. so you can see that query is resulting only one line as

data1   100     more data1      123.001

I've changed the json file to employee.json which contains

{ "firstName" : "Mike", "lastName" : "Chepesky", "employeeNumber" : 1840192 }

and changed table also but it showing a null values when i query the table

hive> add jar /home/hduser/pradi/hive-json-serde-0.2.jar;
Added /home/hduser/pradi/hive-json-serde-0.2.jar to class path
Added resource: /home/hduser/pradi/hive-json-serde-0.2.jar

hive> create EXTERNAL table employees_json (firstName string, lastName string,        employeeNumber int )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
OK
Time taken: 0.297 seconds


hive> load data local inpath '/home/hduser/pradi/employees.json' into table     employees_json;
Copying data from file:/home/hduser/pradi/employees.json
Copying file: file:/home/hduser/pradi/employees.json
Loading data to table default.employees_json
OK
Time taken: 0.293 seconds


 hive>select * from employees_json;
  OK
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
  NULL    NULL    NULL
Time taken: 0.194 seconds
pramav
  • 473
  • 3
  • 11
  • 20

4 Answers4

3

A bit hard to tell what's going on without the logs (see Getting Started) in case of doubt. Just a quick thought - can you try if it works with WITH SERDEPROPERTIESas so:

CREATE EXTERNAL TABLE my_table (field1 string, field2 int, 
                                field3 string, field4 double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES (
  "field1"="$.field1",
  "field2"="$.field2",
  "field3"="$.field3",
  "field4"="$.field4" 
);

There is also a fork you might want to give a try from ThinkBigAnalytics.

UPDATE: Turns out the input in Test.json is invalid JSON hence the records get collapsed.

See answer https://stackoverflow.com/a/11707993/396567 for further details.

Community
  • 1
  • 1
Michael Hausenblas
  • 13,162
  • 4
  • 52
  • 66
  • Hi Michael, same problem even after creating table as you said WITH SERDEPROPERTIES. I checked logs but couldnt make out anything from those logs. – pramav Feb 06 '13 at 05:31
  • yeah same thing which u've written above. But what's the difference in Create table and Create external table? And also when i drop this table(CREATE EXTERNAL), it is not deleted from HDFS. – pramav Feb 06 '13 at 09:24
  • 1
    EXTERNAL means Hive does not own the data but the metadata for the table is deleted on DROP. Can you update your post with the full output of your query pls? – Michael Hausenblas Feb 06 '13 at 09:35
  • Hi Michael, I've posted the the full output of the query. – pramav Feb 06 '13 at 10:04
  • 1
    OMG, I focused so much on Hive to only now figure that your input is invalid. Can't repeat keys in JSON and that's why int he load phase the records are collapsed. Check with linters such as http://jsonlint.com/ first to make sure it's valid JSON. – Michael Hausenblas Feb 06 '13 at 10:13
  • If we cant repeat multiple keys then how can we mention multiple values for single key like for "firstname" how to mention 10 different names – pramav Feb 06 '13 at 10:54
  • Hi Michael, I've changed the table and json file, this time it is valid json file. but after queriying the table i'm getting null values. I've posted the full output of the query – pramav Feb 07 '13 at 10:09
  • Got it running with https://github.com/rcongiu/Hive-JSON-Serde (need to build the jar first from there). – Michael Hausenblas Feb 07 '13 at 13:54
  • I've taken from here http://code.google.com/p/hive-json-serde/downloads/list. Do i've to build this also? – pramav Feb 08 '13 at 05:07
  • No, there the jars are already available - try the one from GitHub (all you need is maven). – Michael Hausenblas Feb 08 '13 at 09:18
  • You mean to say that jar from code.google.com/p/hive-json-serde/downloads/list. is older version thats why i'm getting null values. I never used maven. The jar from github is the correct one. – pramav Feb 08 '13 at 09:52
0
  1. First of all you have to validate your json file on http://jsonlint.com/ after that make your file as one row per line and remove the [ ]. the comma at the end of the line is mandatory.

    [{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}, {"field1":"data2","field2":200,"field3":"more data2","field4":123.002}, {"field1":"data3","field2":300,"field3":"more data3","field4":123.003}, {"field1":"data4","field2":400,"field3":"more data4","field4":123.004}]

  2. In my test I added hive-json-serde-0.2.jar from hadoop cluster , I think hive-json-serde-0.1.jar should be ok.

    ADD JAR hive-json-serde-0.2.jar;

  3. Create your table

    CREATE TABLE my_table (field1 string, field2 int, field3 string, field4 double) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' ;

  4. Load your Json data file ,here I load it from hadoop cluster not from local

    LOAD DATA INPATH 'Test2.json' INTO TABLE my_table;

My test

Salem
  • 1
  • 1
0

for json parsing based on cwiki/confluence we need follow some steps

  1. need to download hive-hcatalog-core.jar

  2. hive> add jar /path/hive-hcatalog-core.jar

  3. create table tablename(colname1 datatype,.....) row formatserde'org.apache.hive.hcatalog.data.JsonSerDe' stored as ORCFILE;

  4. colname in creating table and colname in test.json must be same if not it will show null values Hope it wil helpfull

Jagadeesh
  • 11
  • 2
0

I solved similar problem -

  1. I took the jar from - [http://www.congiu.net/hive-json-serde/1.3.8/hdp23/json-serde-1.3.8-jar-with-dependencies.jar]

  2. Run the command in Hive CLI - add jar /path/to/jar

  3. Created table using -
create table messages (
    id int,
    creation_date string,
    text string,
    loggedInUser STRUCT<id:INT, name: STRING>
)
row format serde "org.openx.data.jsonserde.JsonSerDe";
  1. This is my JSON data -
{"id": 1,"creation_date": "2020-03-01","text": "I am on cotroller","loggedInUser":{"id":1,"name":"API"}}
{"id": 2,"creation_date": "2020-04-01","text": "I am on service","loggedInUser":{"id":1,"name":"API"}}
  1. Loaded data in table using -
LOAD DATA LOCAL INPATH '${env:HOME}/path-to-json'
OVERWRITE INTO TABLE messages;
  1. select * from messages;
1   2020-03-01    I am on cotroller   {"id":1,"name:"API"}
2   2020-04-01    I am on service     {"id":1,"name:"API"}
Michael Heil
  • 16,250
  • 3
  • 42
  • 77