SparkSQL + Java: Pojo to Tabular Format while working with Datasets

Question

I'm pretty new to Spark SQL. While implementing one of training tasks I faced the following issue and could not find an answer (all the following examples are a bit dumb, but should be still ok for demonstration purposes).

My app reads a parquet file and creates a dataset basing on its content:

DataFrame input = sqlContext.read().parquet("src/test/resources/integration/input/source.gz.parquet");
Dataset<Row> dataset = input.as(RowEncoder$.MODULE$.apply(input.schema()));

The dataset.show() call results in:

+------------+----------------+--------+
+    Names   +       Gender   +   Age  +
+------------+----------------+--------+
| Jack, Jill |  Male, Female  | 30, 25 |

Then I convert the dataset into a new dataset with the Person type inside:

public static Dataset<Person> transformToPerson(Dataset<Row> rawData) {
    return rawData
            .flatMap((Row sourceRow) -> {
                // code to parse an input row and split person data goes here
                Person person1 = new Person(name1, gender1, age1);
                Person person2 = new Person(name2, gender2, age2);
                return Arrays.asList(person1, person2);
            }, Encoders.bean(Person.class));
}

where

public abstract class Human implements Serializable {
   protected String name;
   protected String gender;
   // getters/setters go here
   // default constructor + constructor with the name and gender params
 }
 public class Person extends Human {
   private String age;
   // getters/setters for the age param go here
   // default constructor + constructor with the age, name and gender params
   // overriden toString() method which returns the string: (<name>, <gender>, <age>)
 }

Finally, when I show the dataset's content I expect to see

 +------------+----------------+--------+
 +    name    +       gender   +   age  +
 +------------+----------------+--------+
 |     Jack   |     Male       |   30   |
 |     Jill   |     Femail     |   25   |

However, I see

+-------------------+----------------+--------+
+      name         +       gender   +   age  +
+-------------------+----------------+--------+
|(Jack, Male, 30)   |                |        |
|(Jill, Femail, 25) |                |        |

Which is a result of the toString() method, while the header is correct. I believe something is wrong with the Encoder, as far as if I use the Encoders.javaSerizlization(T) or Encoders.kryo(T) it shows

+------------------+
+        value     +
+------------------+
|(Jack, Male, 30)  |
|(Jill, Femail, 25)|

What worries me most is maybe the incorrect usage of encoders could result in incorrect SerDe and/or performance penalties. I cannot not see anything special in all Spark Java examples that I can find...

Could you please suggest what I do wrong?

UPDATE 1

Here are my project's dependencies:

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.6.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>1.6.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.10</artifactId>
        <version>1.6.2</version>
    </dependency>

SOLUTION

As abaghel suggested I upgraded the version to 2.0.2 (please be aware that on version 2.0.0 there is the bug for Windows), used Dataset instead of DataFrames everywhere in my code (seems like DataFrames are not a part of Apache Spark starting from 2.0.0), and used the iterator-based flatMap function to transform from Row to Person.

Just to share, the approach of using the TraversableOnce-based flatMap for version 1.6.2 did not work for me as it threw the 'MyPersonConversion$function1 not Serializable' exception.

Now everything is working as expected.

abaghel · Accepted Answer · 2017-10-02T17:42:10.123

What is the version of Spark you are using? Method for flatMap you have provided is not compiling with version 2.2.0. Return type required is Iterator<Person>. Please use below FlatMapFunction and you will get the desired output.

public static Dataset<Person> transformToPerson(Dataset<Row> rawData) {
    return rawData.flatMap(row -> {
        String[] nameArr = row.getString(0).split(",");
        String[] genArr = row.getString(1).split(",");
        String[] ageArr = row.getString(2).split(",");
        Person person1 = new Person(nameArr[0], genArr[0], ageArr[0]);
        Person person2 = new Person(nameArr[1], genArr[1], ageArr[1]);
        return Arrays.asList(person1, person2).iterator();
    }, Encoders.bean(Person.class));
}

//Call function
Dataset<Person> dataset1 = transformToPerson(dataset);
dataset1.show();

abaghel, thank you! Please see the 'update 1' in the original question: I added the details — Dmitry, Oct 02 '17 at 20:41
Thanks for the update @Dmitry. It looks like there is some issue with bean encoder in Spark 1.6.2 version. Please try with Spark version 2.0 and above. — abaghel, Oct 03 '17 at 05:22

SparkSQL + Java: Pojo to Tabular Format while working with Datasets

1 Answers1