Incorrect column names in spark dataset

Question

I have next case class:

case class Data[T](field1: String, field2: T)

I'm using kryo serializer with next implicits for it:

implicit def single[A](implicit c: ClassTag[A]): Encoder[A] = Encoders.kryo[A](c)

implicit def tuple2[A1, A2](implicit e1: Encoder[A1], e2: Encoder[A2]): Encoder[(A1, A2)] =
        Encoders.tuple[A1, A2](e1, e2)

...

And I tried to make next join:

val ds1 = someDataframe1.as[(String, T)].map(row => Data(row._1, row._2))
val ds1 = someDataframe2.as[(String, T)].map(row => Data(row._1, row._2))
ds1.joinWith(ds2, col("field1") === col("field1"), "left_outer")

After that I got next exception:

org.apache.spark.sql.AnalysisException: cannot resolve 'field1' given input columns: [value, value];

What happened with column names in my datasets?

UPD: when I called ds1.schema I got next output:

StructField(name = value,dataType = BinaryType, nullable = true)

I think that I have problem with kryo serialization (there is no schema metadata, case class was serialized just as single blob field without name). Also I noticed that all works good when T is kryo known class (Int, String) or case class. But when T is some Java bean I get my Data dataset schema as single blob unnamed field.

Spark version 1.6.1

I am getting a different error message (Spark version?). Anyway, have you tried `col("_1.field1") === col("_2.field2)` as the join condition? This way it is working for me. — Beryllium, Sep 19 '16 at 14:05

score 0 · Answer 1 · answered Jan 27 '17 at 23:51

You just created a dataset with only one column typed Data, and you cannot access the fields(fields1) inside this object.

dataframe: |value|
           |------
           |Data |
           |------
           |Data |
           |------

You may try this to convert your dataframe to dataest: val ds1 = someDataframe1.as[Data]

dataframe: |field1|field2|
           |------
           |String|  T   |
           |------
           |String|  T   |
           |------

Or if you still want to use your dataframe, try to change the searching criteria：

ds.joinWith(ds2, df.col("value").field1 === df2.col("value").field2)

Incorrect column names in spark dataset

1 Answers1