1

A quick question on the JSON handling in PIG.

I tried some JsonLoader called Elephant-Bird to load and handle JSON data like the followings:

{
   "SV":1,
   "AD":[
      {
         "ID":"46931606",
         "C1":"46",
         "C2":"469",
         "ST":"46931",
         "PO":1
      },
      {
         "ID":"46721489",
         "C1":"46",
         "C2":"467",
         "ST":"46721",
         "PO":5
      }
   ]
}

The loader works well for simple fields but it doesn't work well for any array field. I don't know how I can access elements in the array ("AD" field above) with this UDF or in any other way? Please advise.

Sirko
  • 72,589
  • 19
  • 149
  • 183
kee
  • 10,969
  • 24
  • 107
  • 168
  • I created my own UDF to handle this case specifically (not in generic fashion though) in case other ppl encounter similar need. – kee Mar 28 '12 at 01:55
  • Have you tried doing it with dots or hashes? I remember using one of those to access stuff in hashes from out of a json array loaded through Elephant bird. – Eli Mar 29 '12 at 20:14

3 Answers3

3

You should use -nestedLoad param like this:

a = load 'input' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]).

And then you use the following code:

b = FOREACH a GENERATE (json#'AD') as AD:bag{t:Tuple(m:map[])};

Then your json array become a bag datatype. You can flatten it to get tuple.

c = FOREACH b GENERATE FLATTEN(AD);
d = FOREACH c GENERATE AD::m#ID AS ID, AD::m#C1 AS C1, AD::m#C2 AS C2, AD::m#ST AS ST, AD::m#PO AS PO

At this time, you will get the tuple data type which the schema is (ID:bytearray, C)

0

I think the array part gets returned as a bag with the elephant bird JsonLoader.

I don't know if it is the right way to do it, but a work around could be to flatten AD - drawback being you will now have multiple rows corresponding to each element in the array AD.

Stephan
  • 41,764
  • 65
  • 238
  • 329
dejavu013
  • 172
  • 7
0

As the others have commented, during loading array becomes a bag which you need to flatten. To make it self-containing, here is an example of the entire script:

REGISTER elephant-bird-core-4.3.jar;
REGISTER elephant-bird-hadoop-compat-4.5.jar;
REGISTER elephant-bird-pig-4.5.jar;

DEFINE JsonLoader com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true');

records = LOAD '$DATA_PATH' USING JsonLoader() AS (data: map[]);
records = FOREACH records GENERATE 
                                data#'SV' AS SV,
                                FLATTEN(data#'AD') AS AD;
records = FOREACH records GENERATE
                                SV,
                                AD#'ID' AS ID,
                                AD#'C1' AS C1,
                                AD#'C2' AS C2,
                                AD#'ST' AS ST,
                                AD#'PO' AS PO;
djurikom
  • 109
  • 1
  • 5