Try using something like this:
sqlContext.read.
schema(
ScalaReflection.schemaFor[Some_Case_Class].dataType.asInstanceOf[StructType]
).json(some_json)
your case class can look like this code:
case class Banner(w : Int, h : Int, pos : Int)
case class Imp(id : Long, Banner : List[Banner])
case class DataSet(opp_id : Long, id :Long, date : Long, imp : Imp)
Your reflection could be something like this:
ScalaReflection.schemaFor[DataSet].dataType.asInstanceOf[StructType]
You need to have:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
I get:
res7: org.apache.spark.sql.types.StructType = StructType(StructField(opp_id,LongType,false), StructField(id,LongType,false), StructField(date,LongType,false), StructField(imp,StructType(StructField(id,LongType,false), StructField(banner,ArrayType(StructType(StructField(w,IntegerType,false), StructField(h,IntegerType,false), StructField(pos,IntegerType,false)),true),true)),true))
If you type this into your code in by hand, you might not see some of the Scala Spark Shell nuances - namely that Seq
needs to be inserted in certain places in order to really see the schema has some series data. To be clearer about where these are, you can issue:
res7.printTreeString
root
|-- opp_id: long (nullable = false)
|-- id: long (nullable = false)
|-- date: long (nullable = false)
|-- imp: struct (nullable = true)
| |-- id: long (nullable = false)
| |-- banner: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- w: integer (nullable = false)
| | | |-- h: integer (nullable = false)
| | | |-- pos: integer (nullable = false)
Since there are other questions and answers about this Seq
part, I'm deferring the rest as you now have the tools you probably need.