1

I am trying to convert a Dataframe to a Dataset, and the java classes structure is as follows:

class A:

public class A {

    private int a;

    public int getA() {
        return a;
    }

    public void setA(int a) {
        this.a = a;
    }
}

class B:

public class B extends A {

    private int b;

    public int getB() {
        return b;
    }

    public void setB(int b) {
        this.b = b;
    }
}

and class C

public class C {

    private A a;

    public A getA() {
        return a;
    }

    public void setA(A a) {
        this.a = a;
    }
}

and the data in the dataframe is as follows :

+-----+
|  a  |
+-----+
|[1,2]|
+-----+

When I am trying to apply Encoders.bean[C](classOf[C]) to the dataframe. The object reference A which is a instance of B in class C is not returning true when I am checking for .isInstanceOf[B], I am getting it as false. The output of Dataset is as follows:

+-----+
|  a  |
+-----+
|[1,2]|
+-----+

How do we get all the fields of A and B under the C object while iterating over it in foreach?

Code :-

object TestApp extends App {

  implicit val sparkSession = SparkSession.builder()
    .appName("Test-App")
    .config("spark.sql.codegen.wholeStage", value = false)
    .master("local[1]")
    .getOrCreate()


  var schema = new StructType().
    add("a", new ArrayType(new StructType().add("a", IntegerType, true).add("b", IntegerType, true), true))


  var dd = sparkSession.read.schema(schema).json("Test.txt")

  var ff = dd.as(Encoders.bean[C](classOf[C]))
  ff.show(truncate = false)



  ff.foreach(f => {
    println(f.getA.get(0).isInstanceOf[A])//---true
    println(f.getA.get(0).isInstanceOf[B])//---false
  })

Content of File : {"a":[{"a":1,"b":2}]}

Chirag
  • 211
  • 4
  • 16

1 Answers1

0

Spark-catalyst uses google reflection to get schema out of java beans. Please take a look at the JavaTypeInference.scala#inferDataType. This class uses getters to collect the field name and the returnType of getters to compute the SparkType.

Since class C has getter named getA() with returnType as A and A, in turn, has getter as getA() with returnType as int, Schema will be created as struct<a:struct<a:int>> where struct<a:int> is derived from the getA of class A.

The solution to this problem that I can think of is -

// Modify your class C to have Real class reference rather its super type
public class C {

    private B a;

    public B getA() {
        return a;
    }

    public void setA(B a) {
        this.a = a;
    }
}

Output-

root
 |-- a: struct (nullable = true)
 |    |-- a: integer (nullable = false)
 |    |-- b: integer (nullable = false)

+------+
|a     |
+------+
|[1, 2]|
+------+
Som
  • 6,193
  • 1
  • 11
  • 22
  • There are 10-15 classes which extends class A. If I follow this approach I will have to maintain that many variations of class C. I am looking to avoid that. – Chirag May 30 '20 at 09:39
  • Why can't you create your static schema having all the fields of A's subclasses rather depending on the schema created by reflection. – Som May 30 '20 at 09:49
  • Can you give a sample? – Chirag May 30 '20 at 12:42
  • Can you tell me the usecase? Basically wanted to know the input and expected output. – Som May 30 '20 at 16:38