2

I am doing some calculations on row level in Scala/Spark. I have a dataframe created using JSON below-

{"available":false,"createTime":"2016-01-08","dataValue":{"names_source":{"first_names":["abc", "def"],"last_names_id":[123,456]},"another_source_array":[{"first":"1.1","last":"ONE"}],"another_source":"TableSources","location":"GMP", "timestamp":"2018-02-11"},"deleteTime":"2016-01-08"}

You can create dataframe using this JSON directly. My schema looks like below-

root
 |-- available: boolean (nullable = true)
 |-- createTime: string (nullable = true)
 |-- dataValue: struct (nullable = true)
 |    |-- another_source: string (nullable = true)
 |    |-- another_source_array: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- first: string (nullable = true)
 |    |    |    |-- last: string (nullable = true)
 |    |-- location: string (nullable = true)
 |    |-- names_source: struct (nullable = true)
 |    |    |-- first_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- last_names_id: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |-- timestamp: string (nullable = true)
 |-- deleteTime: string (nullable = true)

I am reading all columns separately with readSchema and writing with writeSchema. Out of two complex columns, I am able to process one but not other.

Below is a part of my read schema-

  .add("names_source", StructType(
      StructField("first_names", ArrayType.apply(StringType)) ::
        StructField("last_names_id", ArrayType.apply(DoubleType)) ::
        Nil
))
.add("another_source_array", ArrayType(StructType(
      StructField("first", StringType) ::
        StructField("last", StringType) ::
        Nil
)))

Here is a part of my write schema-

.add("names_source", StructType.apply(Seq(
           StructField("first_names", StringType),
            StructField("last_names_id", DoubleType))
  ))
  .add("another_source_array", ArrayType(StructType.apply(Seq(
           StructField("first", StringType),
            StructField("last", StringType))
   )))

In processing, I am using a method to index all columns. below is my piece of code for the function-

   def myMapRedFunction(df: DataFrame, spark: SparkSession): DataFrame = {

    val columnIndex = dataIndexingSchema.fieldNames.zipWithIndex.toMap

    val myRDD = df.rdd
      .map(row => {
      Row(
        row.getAs[Boolean](columnIndex("available")),
        parseDate(row.getAs[String](columnIndex("create_time"))),
        ??I Need help here??
        row.getAs[String](columnIndex("another_source")),
        anotherSourceArrayFunction(row.getSeq[Row](columnIndex("another_source_array"))),
        row.getAs[String](columnIndex("location")),
        row.getAs[String](columnIndex("timestamp")),
        parseDate(row.getAs[String](columnIndex("delete_time")))
      )
    }).distinct

    spark.createDataFrame(myRDD, dataWriteSchema)
  }

another_source_array column is being processed by anotherSourceArrayFunction method to make sure we get schema as per the requirements. I need a similar function to get names_source column. Below is the function that I am using for another_source_array column.

    def anotherSourceArrayFunction(data: Seq[Row]): Seq[Row] = {
    if (data == null) {
      data
    } else {
      data.map(r => {
        val first = r.getAs[String]("first").ToUpperCase()
        val last = r.getAs[String]("last")
        new GenericRowWithSchema(Array(first,last), StructType(
          StructField("first", StringType) ::
            StructField("last", StringType) ::
            Nil
        ))
      })
    }
  }

Probably in short, I need something like this, where I can get my names_source column structure as a struct.

names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
another_source_array:array<struct<first:string,last:string>>

Above are the column schema required finally. I am able to get another_source_array properly and need help in names_source. I think my write schema for this column is correct but I am not sure. But I need finally names_source:struct<first_names:array<string>,last_names_id:array<bigint>> as column schema.

Note: I am able to get another_source_array column carefully without any problem. I kept here that function to make it better understanding.

zero323
  • 322,348
  • 103
  • 959
  • 935
ChikuMiku
  • 509
  • 2
  • 11
  • 22

1 Answers1

8

From what I see in all the codes you've tried is that you are trying to flatten the struct dataValue column to separate columns.

If my assumption is correct then you don't have to go through such complexity. You can simply do the follwoing

val myRDD = df.rdd
  .map(row => {
    Row(
      row.getAs[Boolean]("available"),
      parseDate(row.getAs[String]("createTime")),
      row.getAs[Row]("dataValue").getAs[Row]("names_source"),
      row.getAs[Row]("dataValue").getAs[String]("another_source"),
      row.getAs[Row]("dataValue").getAs[Seq[Row]]("another_source_array"),
      row.getAs[Row]("dataValue").getAs[String]("location"),
      row.getAs[Row]("dataValue").getAs[String]("timestamp"),
      parseDate(row.getAs[String]("deleteTime"))
    )
  }).distinct

import org.apache.spark.sql.types._

val dataWriteSchema = StructType(Seq(
  StructField("createTime", DateType, true),
  StructField("createTime", StringType, true),
  StructField("names_source", StructType(Seq(StructField("first_names", ArrayType(StringType), true), StructField("last_names_id", ArrayType(LongType), true))), true),
  StructField("another_source", StringType, true),
  StructField("another_source_array", ArrayType(StructType.apply(Seq(StructField("first", StringType),StructField("last", StringType)))), true),
  StructField("location", StringType, true),
  StructField("timestamp", StringType, true),
  StructField("deleteTime", DateType, true)
))

spark.createDataFrame(myRDD, dataWriteSchema).show(false)


using * to flatten the struct column

You can simply use .* on struct column for the elements of struct column to be on separate columns

import org.apache.spark.sql.functions._
df.select(col("available"), col("createTime"), col("dataValue.*"), col("deleteTime")).show(false)

You will have to change the string dates column to dateType in this method

In both the cases you shall get output as

+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|available|createTime|names_source                                   |another_source|another_source_array|location|timestamp |deleteTime|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|false    |2016-01-08|[WrappedArray(abc, def),WrappedArray(123, 456)]|TableSources  |[[1.1,ONE]]         |GMP     |2018-02-11|2016-01-08|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • thanks for explanations, tried to reproduce your solution, but only 2nd method works for me. I got some Spark Exception for the first. Anyway, 2nd method is quite useful for me, thanks again) – hooke Apr 02 '20 at 19:04