Insert JSON to Hadoop using Spark (Java)

Question

I'm very new in Hadoop,

I'm using Spark with Java.

I have dynamic JSON, exmaple:

   {
    "sourceCode":"1234",
    "uuid":"df123-....",
    "title":"my title"
}{
    "myMetaDataEvent": {
        "date":"10/10/2010",
    },
    "myDataEvent": {
        "field1": {
            "field1Format":"fieldFormat",
            "type":"Text",
            "value":"field text"
        }
    }
}

Sometimes I can see only field1 and sometimes I can see field1...field50

And maybe the user can add fields/remove fields from this JSON.

I want to insert this dynamic JSON to hadoop (to hive table) from Spark Java code,

How can I do it?

I want that the user can after make HIVE query, i.e: select * from MyTable where type="Text

I have around 100B JSON records per day that I need to insert to Hadoop,

So what is the recommanded way to do that?

*I'm looked on the following: SO Question but this is known JSON scheme where it isnt my case.

Thanks

score 0 · Accepted Answer · answered Jun 21 '18 at 12:42

I had encountered kind of similar problem, I was able to resolve my problem using this. ( So this might help if you create the schema before you parse the json ).

For a field having a string data type you could create the schema :-

StructField field = DataTypes.createStructField(<name of the field>, DataTypes.StringType, true);

For a field having a int data type you could create the schema :-

StructField field = DataTypes.createStructField(<name of the field>, DataTypes.IntegerType, true);

After you have added all the fields in a List<StructField>,

Eg:-

List<StructField> innerField = new ArrayList<StructField>();
.... Field adding logic ....
Eg:- 
 innerField.add(field1);
 innerField.add(field2);

// One instance can come, or multiple instance of value comes in an array, then it needs to be put in Array Type.

ArrayType getArrayInnerType = DataTypes.createArrayType(DataTypes.createStructType(innerField));

StructField getArrayField = DataTypes.createStructField(<name of field>, getArrayInnerType,true);

You can then create the schema :-

StructType structuredSchema = DataTypes.createStructType(getArrayField);

Then I read the json using the schema generated using the Dataset API.

Dataset<Row> dataRead = sqlContext.read().schema(structuredSchema).json(fileName);

Hi, and after I read the scheme with Dataset, i'm use saveAsTable? I need to pre define all the JSON fields on Hive table? Thanks! — Ya Ko, Jun 21 '18 at 13:57
Yes, once you have the data bound to the schema, you can apply any logic or any function on the dataset . (you can use save as table and you can have array as data type in Hive table also in case you need to store arrays ). — Deepan Ram, Jun 21 '18 at 15:08
Thanks, and this is the fast way to do it? because I have around 100B JSON records per day — Ya Ko, Jun 21 '18 at 15:25
Yes,it is one of the fast way i know.. There could be some other way to parse it in a less time consuming manner. — Deepan Ram, Jun 22 '18 at 05:37
If it resolves your question,can you please accept my primary answer so it could be helpful to other people looking for similar information. — Deepan Ram, Jun 22 '18 at 05:38

Insert JSON to Hadoop using Spark (Java)

1 Answers1