1

Let's say I have some json data in the format:

{
    "opp_id": "IxexyLDIIk",
    "id": "IxexyLDIIk",
    "date": 1488465636,
    "imp": {
        "id": "1",
        "banner": [{
            "w": 728,
            "h": 90,
            "pos": 1
        }]

    }
}

I would like to create a schema in which the field imp, which is a map, can accept any number of values. The issue I see is I cannot do that because inside imp there is a field banner, which is an array.

How can one create such schema in spark? that is specify a map that can have any keys and some of those keys have a specific schema.

Ideally I would like a solution in json schema, but scala / pyspark is fine

To clarify, i would like to know if its possible to do:

df = spark.read.json(data, schema=THIS_IS_WHAT_I_NEED)
Manuel G
  • 1,523
  • 1
  • 21
  • 34

1 Answers1

0

Try using something like this:

sqlContext.read.
schema(
     ScalaReflection.schemaFor[Some_Case_Class].dataType.asInstanceOf[StructType]
).json(some_json)

your case class can look like this code:

case class Banner(w : Int, h : Int, pos : Int)
case class Imp(id : Long, Banner : List[Banner])
case class DataSet(opp_id : Long, id :Long, date : Long, imp : Imp)

Your reflection could be something like this:

ScalaReflection.schemaFor[DataSet].dataType.asInstanceOf[StructType]

You need to have:

import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType

I get:

res7: org.apache.spark.sql.types.StructType = StructType(StructField(opp_id,LongType,false), StructField(id,LongType,false), StructField(date,LongType,false), StructField(imp,StructType(StructField(id,LongType,false), StructField(banner,ArrayType(StructType(StructField(w,IntegerType,false), StructField(h,IntegerType,false), StructField(pos,IntegerType,false)),true),true)),true))

If you type this into your code in by hand, you might not see some of the Scala Spark Shell nuances - namely that Seq needs to be inserted in certain places in order to really see the schema has some series data. To be clearer about where these are, you can issue:

res7.printTreeString
root
 |-- opp_id: long (nullable = false)
 |-- id: long (nullable = false)
 |-- date: long (nullable = false)
 |-- imp: struct (nullable = true)
 |    |-- id: long (nullable = false)
 |    |-- banner: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- w: integer (nullable = false)
 |    |    |    |-- h: integer (nullable = false)
 |    |    |    |-- pos: integer (nullable = false)

Since there are other questions and answers about this Seq part, I'm deferring the rest as you now have the tools you probably need.

codeaperature
  • 1,089
  • 2
  • 10
  • 25
  • Also - There is another similar question / answer that @zero323 answered at [link](http://stackoverflow.com/questions/39083873/spark-2-0-0-reading-json-data-with-variable-schema?rq=1) – codeaperature Mar 17 '17 at 07:21
  • @codeaperture That link doesnt really anwser what I am looking for, which is a way of defining schemas that allow structs with any number of keys as text (so Map(string, string)), and certain specified keys with a different schema. – Manuel G Mar 18 '17 at 18:07
  • How about `case class Test(m : Map[String, String])` with `ScalaReflection.schemaFor[Test].dataType.asInstanceOf[StructType]` to get `StructType(StructField(m,MapType(StringType,StringType,true),true))` ? – codeaperature Mar 19 '17 at 02:41