I'm pretty new to Spark SQL. While implementing one of training tasks I faced the following issue and could not find an answer (all the following examples are a bit dumb, but should be still ok for demonstration purposes).
My app reads a parquet file and creates a dataset basing on its content:
DataFrame input = sqlContext.read().parquet("src/test/resources/integration/input/source.gz.parquet");
Dataset<Row> dataset = input.as(RowEncoder$.MODULE$.apply(input.schema()));
The dataset.show() call results in:
+------------+----------------+--------+
+ Names + Gender + Age +
+------------+----------------+--------+
| Jack, Jill | Male, Female | 30, 25 |
Then I convert the dataset into a new dataset with the Person type inside:
public static Dataset<Person> transformToPerson(Dataset<Row> rawData) {
return rawData
.flatMap((Row sourceRow) -> {
// code to parse an input row and split person data goes here
Person person1 = new Person(name1, gender1, age1);
Person person2 = new Person(name2, gender2, age2);
return Arrays.asList(person1, person2);
}, Encoders.bean(Person.class));
}
where
public abstract class Human implements Serializable {
protected String name;
protected String gender;
// getters/setters go here
// default constructor + constructor with the name and gender params
}
public class Person extends Human {
private String age;
// getters/setters for the age param go here
// default constructor + constructor with the age, name and gender params
// overriden toString() method which returns the string: (<name>, <gender>, <age>)
}
Finally, when I show the dataset's content I expect to see
+------------+----------------+--------+
+ name + gender + age +
+------------+----------------+--------+
| Jack | Male | 30 |
| Jill | Femail | 25 |
However, I see
+-------------------+----------------+--------+
+ name + gender + age +
+-------------------+----------------+--------+
|(Jack, Male, 30) | | |
|(Jill, Femail, 25) | | |
Which is a result of the toString() method, while the header is correct. I believe something is wrong with the Encoder, as far as if I use the Encoders.javaSerizlization(T) or Encoders.kryo(T) it shows
+------------------+
+ value +
+------------------+
|(Jack, Male, 30) |
|(Jill, Femail, 25)|
What worries me most is maybe the incorrect usage of encoders could result in incorrect SerDe and/or performance penalties. I cannot not see anything special in all Spark Java examples that I can find...
Could you please suggest what I do wrong?
UPDATE 1
Here are my project's dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.6.2</version>
</dependency>
SOLUTION
As abaghel suggested I upgraded the version to 2.0.2 (please be aware that on version 2.0.0 there is the bug for Windows), used Dataset instead of DataFrames everywhere in my code (seems like DataFrames are not a part of Apache Spark starting from 2.0.0), and used the iterator-based flatMap function to transform from Row to Person.
Just to share, the approach of using the TraversableOnce-based flatMap for version 1.6.2 did not work for me as it threw the 'MyPersonConversion$function1 not Serializable' exception.
Now everything is working as expected.