7

I am trying to convert a DataSet to java object. The schema is like

root
 |-- deptId: long (nullable = true)
 |-- depNameName: string (nullable = true)
 |-- employee: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- phno: Long (nullable = true)
 |    |    |    |-- element: integer (containsNull = true)

I created the pojo classes Like.

class Department {
  private Long deptId;
  private String depName;
  private List<Employee> employess;
  //with getter setters and no argument constructor
  }



class Employee {
  private String firstName;
  private String lastName;
  private List<Long> phno;
  //With getter setter and no argument constructor 
 }

Now here is the code I am trying for the conversion.

  Dataset<Row> ds = this.spark.read().parquet(Parquet file path);
  Dataset<Department> departmentDataset = 
  ds.as(Encoders.bean(Department.class));
  JavaRDD<String> rdd = 

departmentDataset.toJavaRDD().map((Function<Department, String>) v -> {

            StringBuilder sb = new StringBuilder();
            sb.append("deptId").append(v.getDeptID());
            if(!CollectionUtil.isListNullOrEmpty(v.employee))

   sb.append("FirstName").append(v.getEmployee().get(0).getName);

   if(!CollectionUtil.isListNullOrEmpty(v.getEmployee().getPhno()))
            sb.append("Ph 
    number").append(v.getEmployee().getPhno().get(0));

            return sb.toString();
        });

But this code is not working . It is failed with org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException. But I can convert this using Row based constructor where I need to hard code the column name . Like

public Department(Row row)
 {
  this.employees  = new ArrayList<Employee>
  this.deptaID  = (Long)row.getAs("deptId");
  List rowList = (List)row.getList(row.fieldIndex("employee"));
    if (rowList!=null) {
      for (Row r : rowList) {
        Employee obj = new Employee(r);
        employees.add(obj);
      }
    }


 public Employee(Row row)
 {
 this.phno  = new ArrayList<Long>
 this.firstName  = (Long)row.getAs("firstName");
  List rowList = (List)row.getList(row.fieldIndex("phno"));
    if (rowList!=null) {
      for (Row r : rowList) {          
        phno.add(r);
      }
    }

 JavaRDD<Department> rdd =  ds.toJavaRDD().map(Department::new);
 JavaRDD<String> rdd     = rdd.map((Function<Department, String>) v -> {

                StringBuilder sb = new StringBuilder();
                sb.append("deptId").append(v.getDeptID());
                if(!CollectionUtil.isListNullOrEmpty(v.employee))

sb.append("FirstName").append(v.getEmployee().get(0).getName);

if(!CollectionUtil.isListNullOrEmpty(v.getEmployee().getPhno()))
                sb.append("Ph 
number").append(v.getEmployee().getPhno().get(0));

                return sb.toString();
            });

By this approach I get sucess. But It includes a lot of hard coding of Schema name and all. So looking for a more elegant solution.

Please suggest the best solution for this problem.

zero323
  • 322,348
  • 103
  • 959
  • 935
Aslan
  • 71
  • 1
  • 2
  • 1
    Have a look at the accepted answer here: https://stackoverflow.com/questions/28166555/how-to-convert-row-of-a-scala-dataframe-into-case-class-most-efficiently. I know the question is about scala, but the accepted answer is in actually in Java. – Glennie Helles Sindholt Nov 06 '18 at 09:56
  • Using Encoders approach works if there is no nested object .Like here List. @GlennieHellesSindholt – Aslan Nov 06 '18 at 13:32
  • Any reference to the solution you adopted? – maddie Jun 16 '20 at 00:19

0 Answers0