3

Given a simple Scala case class like this:

package com.foo.storage.schema   
case class Person(name: String, age: Int)

it's possible to create a Spark schema from a case class as follows:

import org.apache.spark.sql._
import com.foo.storage.schema.Person  

val schema = Encoders.product[Person].schema

I wonder if it's possible to access the schema from a case class in Python/PySpark. I would hope to do something like this [Python]:

jvm = sc._jvm
py4j_class = jvm.com.foo.storage.schema.Person 
jvm.org.apache.spark.sql.Encoders.product(py4j_class)

This throws an error com.foo.storage.schema.Person._get_object_id does not exist in the JVM. The Encoders.product is a generic in Scala, and I'm not entirely sure how to specify the type using Py4J. Is there a way to use the case class to create a PySpark schema?

RvdV
  • 406
  • 3
  • 10

1 Answers1

1

I've found there's no clean / easy way to do this using generics, also not as a pure Scala function. What I ended up doing is making a companion object for the case class that can fetch the schema.

Solution

package com.foo.storage.schema
case class Person(name: String, age: Int)
object Person {
  def getSchema = Encoders.product[Person].schema
}

This function can be called from Py4J, but will return a JavaObject. It can be converted with a helper function like this:

from pyspark.sql.types import StructType
import json
def java_schema_to_python(j_schema):
  json_schema = json.loads(ddl.json())
  return StructType.fromJson(json_schema)

Finally, we can extract our schema:

j_schema = jvm.com.foo.storage.Person.getSchema()
java_schema_to_python(j_schema)

Alternative solution

I found there is one more way to do this, but I like the first one better. You can make a generic function that infers the type of the argument in Scala, and uses that to infer the type:

object SchemaConverter {
  def getSchemaFromType[T <: Product: TypeTag](obj: T): StructType = {
     Encoders.product[T].schema
  }
}

Which can be called like this:

val schema = SchemaConverter.getSchemaFromType(Person("Joe", 42))

I didn't like this method since it requires you to create a dummy instance of the case class. Haven't tested it, but I think the function above could be called using Py4J too.

RvdV
  • 406
  • 3
  • 10