4

I have a flat dataframe (df) with the structure as below:

root
 |-- first_name: string (nullable = true)
 |-- middle_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- title: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_Date: string (nullable = true)
 |-- city: string (nullable = true)
 |-- zip_code: string (nullable = true)
 |-- state: string (nullable = true)
 |-- country: string (nullable = true)
 |-- email_name: string (nullable = true)
 |-- company: struct (nullable = true)
 |-- org_name: string (nullable = true)
 |-- company_phone: string (nullable = true)
 |-- partition_column: string (nullable = true)

And I need to convert this dataframe into a structure like (as my next data will be in this format):

root
 |-- firstName: string (nullable = true)
 |-- middleName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- currentPosition: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- startDate: string (nullable = true)
 |    |    |-- endDate: string (nullable = true)
 |    |    |-- address: struct (nullable = true)
 |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- zipCode: string (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |    |-- emailName: string (nullable = true)
 |    |    |-- company: struct (nullable = true)
 |    |    |    |-- orgName: string (nullable = true)
 |    |    |    |-- companyPhone: string (nullable = true)
 |-- partitionColumn: string (nullable = true)

So far I have implemented this:

case class IndividualCompany(orgName: String,
                             companyPhone: String)

case class IndividualAddress(city: String,
                   zipCode: String,
                   state: String,
                   country: String)

case class IndividualPosition(title: String,
                              startDate: String,
                              endDate: String,
                              address: IndividualAddress,
                              emailName: String,
                              company: IndividualCompany)

case class Individual(firstName: String,
                     middleName: String,
                     lastName: String,
                     currentPosition: Seq[IndividualPosition],
                     partitionColumn: String)


val makeCompany = udf((orgName: String, companyPhone: String) => IndividualCompany(orgName, companyPhone))
val makeAddress = udf((city: String, zipCode: String, state: String, country: String) => IndividualAddress(city, zipCode, state, country))

val makePosition = udf((title: String, startDate: String, endDate: String, address: IndividualAddress, emailName: String, company: IndividualCompany) 
                    => List(IndividualPosition(title, startDate, endDate, address, emailName, company)))


val selectData = df.select(
      col("first_name").as("firstName"),
      col("middle_name).as("middleName"),
      col("last_name").as("lastName"),
      makePosition(col("job_title"),
        col("start_date"),
        col("end_Date"),
        makeAddress(col("city"),
          col("zip_code"),
          col("state"),
          col("country")),
        col("email_name"),
        makeCompany(col("org_name"),
          col("company_phone"))).as("currentPosition"),
      col("partition_column").as("partitionColumn")
    ).as[Individual]

select_data.printSchema()
select_data.show(10)

I can see a proper schema generated for select_data, but it gives an error on the last line where I am trying to get some actual data. I am getting an error saying failed to execute user defined function.

 org.apache.spark.SparkException: Failed to execute user defined function(anonfun$4: (string, string, string, struct<city:string,zipCode:string,state:string,country:string>, string, struct<orgName:string,companyPhone:string>) => array<struct<title:string,startDate:string,endDate:string,address:struct<city:string,zipCode:string,state:string,country:string>,emailName:string,company:struct<orgName:string,companyPhone:string>>>)

Is there any better way to achieve this?

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Possible duplicate of [Spark UDF for StructType / Row](https://stackoverflow.com/q/42931796/10465355) – 10465355 Dec 05 '18 at 20:38

2 Answers2

2

The problem here is that an udf can't take IndividualAddress and IndividualCompany directly as input. These are represented as structs in Spark and to use them in an udf the correct input type is Row. That means you need to change the declaration of makePosition to:

val makePosition = udf((title: String, 
                        startDate: String, 
                        endDate: String, 
                        address: Row, 
                        emailName: String, 
                        company: Row)

Inside the udf you now need to use e.g. address.getAs[String]("city") to access the case class elements, and to use the class as a whole you need to create it again.

The easier and better alternative would be to do everything in a single udf as follows:

val makePosition = udf((title: String, 
    startDate: String, 
    endDate: String, 
    city: String, 
    zipCode: String, 
    state: String, 
    country: String,
    emailName: String, 
    orgName: String, 
    companyPhone: String) => 
        Seq(
          IndividualPosition(
            title, 
            startDate, 
            endDate, 
            IndividualAddress(city, zipCode, state, country),
            emailName, 
            IndividualCompany(orgName, companyPhone)
          )
        )
)
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Thanks for this solution. Doing everything in one `udf` doesn't allow you to pass more than 10 parameters and that's why I was going for nested `udf`. – Harshad_Pardeshi Dec 06 '18 at 14:53
  • For the first solution, how do I pass the `Row` type to this udf from a `df.select()` method – Harshad_Pardeshi Dec 06 '18 at 14:54
  • 1
    Got it. I just modified my makeAddress and makeCompany method as `val makeCompany = udf((orgName: String, companyPhone: String) => {Row(orgName, companyPhone)},companySchema)` – Harshad_Pardeshi Dec 06 '18 at 17:22
1

I had a similar requirement.
What I did was create a typed user defined aggregation which will produce a List of elements.

import org.apache.spark.sql.{Encoder, TypedColumn}
import org.apache.spark.sql.expressions.Aggregator
import scala.collection.mutable

object ListAggregator {
  private type Buffer[T] = mutable.ListBuffer[T]

  /** Returns a column that aggregates all elements of type T in a List. */
  def create[T](columnName: String)
               (implicit listEncoder: Encoder[List[T]], listBufferEncoder: Encoder[Buffer[T]]): TypedColumn[T, List[T]] =
    new Aggregator[T, Buffer[T], List[T]] {
      override def zero: Buffer[T] =
        mutable.ListBuffer.empty[T]

      override def reduce(buffer: Buffer[T], elem: T): Buffer[T] =
        buffer += elem

      override def merge(b1: Buffer[T], b2: Buffer[T]): Buffer[T] =
        if (b1.length >= b2.length) b1 ++= b2 else b2 ++= b1

      override def finish(reduction: Buffer[T]): List[T] =
        reduction.toList

      override def bufferEncoder: Encoder[Buffer[T]] =
        listBufferEncoder

      override def outputEncoder: Encoder[List[T]] =
        listEncoder
    }.toColumn.name(columnName)
}

Now you can use it like this.

import org.apache.spark.sql.SparkSession

val spark =
  SparkSession
    .builder
    .master("local[*]")
    .getOrCreate()

import spark.implicits._

final case class Flat(id: Int, name: String, age: Int)
final case class Grouped(age: Int, users: List[(Int, String)])

val data =
  List(
    (1, "Luis", 21),
    (2, "Miguel", 21),
    (3, "Sebastian", 16)
  ).toDF("id", "name", "age").as[Flat]

val grouped =
  data
    .groupByKey(flat => flat.age)
    .mapValues(flat => (flat.id, flat.name))
    .agg(ListAggregator.create(columnName = "users"))
    .map(tuple => Grouped(age = tuple._1, users = tuple._2))
// grouped: org.apache.spark.sql.Dataset[Grouped] = [age: int, users: array<struct<_1:int,_2:string>>]

grouped.show(truncate = false)
// +---+------------------------+
// |age|users                   |
// +---+------------------------+
// |16 |[[3, Sebastian]]        |
// |21 |[[1, Luis], [2, Miguel]]|
// +---+------------------------+
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459