I am trying to convert a DataSet to java object. The schema is like
root
|-- deptId: long (nullable = true)
|-- depNameName: string (nullable = true)
|-- employee: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- phno: Long (nullable = true)
| | | |-- element: integer (containsNull = true)
I created the pojo classes Like.
class Department {
private Long deptId;
private String depName;
private List<Employee> employess;
//with getter setters and no argument constructor
}
class Employee {
private String firstName;
private String lastName;
private List<Long> phno;
//With getter setter and no argument constructor
}
Now here is the code I am trying for the conversion.
Dataset<Row> ds = this.spark.read().parquet(Parquet file path);
Dataset<Department> departmentDataset =
ds.as(Encoders.bean(Department.class));
JavaRDD<String> rdd =
departmentDataset.toJavaRDD().map((Function<Department, String>) v -> {
StringBuilder sb = new StringBuilder();
sb.append("deptId").append(v.getDeptID());
if(!CollectionUtil.isListNullOrEmpty(v.employee))
sb.append("FirstName").append(v.getEmployee().get(0).getName);
if(!CollectionUtil.isListNullOrEmpty(v.getEmployee().getPhno()))
sb.append("Ph
number").append(v.getEmployee().getPhno().get(0));
return sb.toString();
});
But this code is not working . It is failed with org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException
. But I can convert this using Row based constructor where I need to hard code the column name .
Like
public Department(Row row)
{
this.employees = new ArrayList<Employee>
this.deptaID = (Long)row.getAs("deptId");
List rowList = (List)row.getList(row.fieldIndex("employee"));
if (rowList!=null) {
for (Row r : rowList) {
Employee obj = new Employee(r);
employees.add(obj);
}
}
public Employee(Row row)
{
this.phno = new ArrayList<Long>
this.firstName = (Long)row.getAs("firstName");
List rowList = (List)row.getList(row.fieldIndex("phno"));
if (rowList!=null) {
for (Row r : rowList) {
phno.add(r);
}
}
JavaRDD<Department> rdd = ds.toJavaRDD().map(Department::new);
JavaRDD<String> rdd = rdd.map((Function<Department, String>) v -> {
StringBuilder sb = new StringBuilder();
sb.append("deptId").append(v.getDeptID());
if(!CollectionUtil.isListNullOrEmpty(v.employee))
sb.append("FirstName").append(v.getEmployee().get(0).getName);
if(!CollectionUtil.isListNullOrEmpty(v.getEmployee().getPhno()))
sb.append("Ph
number").append(v.getEmployee().getPhno().get(0));
return sb.toString();
});
By this approach I get sucess. But It includes a lot of hard coding of Schema name and all. So looking for a more elegant solution.
Please suggest the best solution for this problem.