21

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch could be more than 20 GB. Entire batch has been generated from 30 data sources and 'key2' of each JSON can be used to identify the source and structure for each source is predefined.

What would be the best approach for processing such data? I have tried using from_json like below but it works only with fixed schema and to use it first I need to group the data based on each source and then apply the schema. Due to large data volume my preferred choice is to scan the data only once and extract required values from each source, based on predefined schema.

import org.apache.spark.sql.types._ 
import spark.implicits._

val data = sc.parallelize(
    """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
    :: Nil)
val df = data.toDF


val schema = (new StructType)
    .add("key1", StringType)
    .add("key2", StringType)
    .add("key3", (new StructType)
    .add("key3_k1", StringType))


df.select(from_json($"value",schema).as("json_str"))
  .select($"json_str.key3.key3_k1").collect
res17: Array[org.apache.spark.sql.Row] = Array([xxx])
zero323
  • 322,348
  • 103
  • 959
  • 935
Syntax
  • 284
  • 1
  • 2
  • 9
  • This is a thought, rather than a fully formed answer. Do you mean that all your schemas share some common keys and those are the ones you care about? [Argonaut's](http://argonaut.io/) `Json` objects are serializable, so you could start by making an `RDD[Json]` and then pick out the relevant fields. (Not all the Argonaut objects are serializable, so you'd have to be careful. Or pick a different JSON library.) – hoyland Mar 03 '18 at 23:21
  • There are few common keys(key1, key2 in example) but I care about variable keys. Common keys has information about the json, like timestamp, source etc and variable part has actual information. – Syntax Mar 04 '18 at 01:45
  • What is your desired output? One dataframe whose columns are the union of all the relevant keys (and null otherwise)? One dataframe for each schema? I'm still thinking you'll need to use some JSON library. You'd have to handle 30 separate cases, but I don't see how you can avoid that in any circumstance--you have 30 different kinds of data. – hoyland Mar 04 '18 at 12:14
  • Yes I am doing 'union all' of 30 DFs. For each DF I am extracting required key-vals in a string column. However I tried 'json-lenses' library in an UDF and it is working, but considering future support and spark's inbuilt optimizations I wanted to try spark's JSON functions – Syntax Mar 04 '18 at 17:04

3 Answers3

13

This is just a restatement of @Ramesh Maharjan's answer, but with more modern Spark syntax.

I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.

def json(jsonDataset: Dataset[String]): DataFrame

Example usage:

val jsonStringDs = spark.createDataset[String](
  Seq(
      ("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
      ("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))

jsonStringDs.show

jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
+----------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                 
|
+----------------------------------------------------------------------------------------------------------------------+
|{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
|{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}  |
+----------------------------------------------------------------------------------------------------------------------+


val df = spark.read.json(jsonStringDs)
df.show(false)

df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
+----------+------------------+-------------+---------+--------+------------+------+------------+
|CEO       |address           |employeeCount|firstname|lastname|marketCap   |name  |revenue     |
+----------+------------------+-------------+---------+--------+------------+------+------------+
|null      |[London,Baker,121]|null         |Sherlock |Holmes  |null        |null  |null        |
|Jeff Bezos|null              |500000       |null     |null    |817117000000|Amazon|177900000000|
+----------+------------------+-------------+---------+--------+------------+------+------------+

The method is available from Spark 2.2.0: http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader@json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame

Wade Jensen
  • 672
  • 7
  • 10
3

If you have data as you mentioned in the question as

val data = sc.parallelize(
    """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
    :: Nil)

You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below

val df = sqlContext.read.json(data)

which will give you schema as below for the rdd data used above

root
 |-- key1: string (nullable = true)
 |-- key2: string (nullable = true)
 |-- key3: struct (nullable = true)
 |    |-- key3_k1: string (nullable = true)

And you can just select key3_k1 as

df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+

You can manipulate the dataframe as you wish. I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • 1
    There are two problems with this approach that I am facing, 1. spark taking long time to determine the schema because my input data is big. 2. Two sources can have same key in same level(in dynamic part) but value could be nested json and in such cases spark will infer schema till common key – Syntax Mar 04 '18 at 06:18
  • @Syntax, You will have to figure out the change the same keys in your input file then, Sorry that I can be of no help more than that – Ramesh Maharjan Mar 04 '18 at 10:21
  • Thanks Ramesh for your suggestion. Unfortunately I can't change source data. – Syntax Mar 04 '18 at 17:08
1

I am not sure if my suggestion can help you although I had a similar case and I solved it as follows:

1) So the idea is to use json rapture (or some other json library) to load JSON schema dynamically. For instance you could read the 1st row of the json file to discover the schema(similarly to what I do here with jsonSchema)

2) Generate schema dynamically. First iterate through the dynamic fields (notice that I project values of key3 as Map[String, String]) and add a StructField for each one of them to schema

3) Apply the generated schema into your dataframe

import rapture.json._
import jsonBackends.jackson._

val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)

import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}

val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))

val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
  case(k, v) => {
    items.append(StructField(k, StringType))
  }
}
val complexColumn =  new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")

val spark = SparkSession.builder().config(sparkConf).getOrCreate()

val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")

jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()

I used the next data as input:

{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}

And the output:

+----+-------+--------+--------+--------+
|key1|   key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+

An advanced alternative, which I haven't tested yet, would be to generate a case class e.g called JsonRow from the JSON schema in order to have a strongly typed dataset which provides better serialization performance apart the fact that make your code more maintainable. To make this work you need first to create a JsonRow.scala file then you should implement a sbt pre-build script which will modify the content of JsonRow.scala(you might have more than one of course) dynamically based on your source files. To generate class JsonRow dynamically you can use the next code:

def generateClass(members: Map[String, String], name: String) : Any = {
    val classMembers = for (m <- members) yield {
        s"${m._1}: String"
    }

    val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
    classDef
  }

The method generateClass accepts a map of strings to create the class members and the class name itself. The members of the generated class you can again populate them from you json schema:

import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._

val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields

for (f <- fields) {
  (f.getKey, f.getValue) match {
    case (k: String, v: TextNode) => mapping(k) = v.asText
    case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
    case _ => None
  }
}

val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)

This prints out:

case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass

Good luck

abiratsis
  • 7,051
  • 3
  • 28
  • 46