I am doing some calculations on row level in Scala/Spark. I have a dataframe created using JSON below-
{"available":false,"createTime":"2016-01-08","dataValue":{"names_source":{"first_names":["abc", "def"],"last_names_id":[123,456]},"another_source_array":[{"first":"1.1","last":"ONE"}],"another_source":"TableSources","location":"GMP", "timestamp":"2018-02-11"},"deleteTime":"2016-01-08"}
You can create dataframe using this JSON directly. My schema looks like below-
root
|-- available: boolean (nullable = true)
|-- createTime: string (nullable = true)
|-- dataValue: struct (nullable = true)
| |-- another_source: string (nullable = true)
| |-- another_source_array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- first: string (nullable = true)
| | | |-- last: string (nullable = true)
| |-- location: string (nullable = true)
| |-- names_source: struct (nullable = true)
| | |-- first_names: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- last_names_id: array (nullable = true)
| | | |-- element: long (containsNull = true)
| |-- timestamp: string (nullable = true)
|-- deleteTime: string (nullable = true)
I am reading all columns separately with readSchema and writing with writeSchema. Out of two complex columns, I am able to process one but not other.
Below is a part of my read schema-
.add("names_source", StructType(
StructField("first_names", ArrayType.apply(StringType)) ::
StructField("last_names_id", ArrayType.apply(DoubleType)) ::
Nil
))
.add("another_source_array", ArrayType(StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
)))
Here is a part of my write schema-
.add("names_source", StructType.apply(Seq(
StructField("first_names", StringType),
StructField("last_names_id", DoubleType))
))
.add("another_source_array", ArrayType(StructType.apply(Seq(
StructField("first", StringType),
StructField("last", StringType))
)))
In processing, I am using a method to index all columns. below is my piece of code for the function-
def myMapRedFunction(df: DataFrame, spark: SparkSession): DataFrame = {
val columnIndex = dataIndexingSchema.fieldNames.zipWithIndex.toMap
val myRDD = df.rdd
.map(row => {
Row(
row.getAs[Boolean](columnIndex("available")),
parseDate(row.getAs[String](columnIndex("create_time"))),
??I Need help here??
row.getAs[String](columnIndex("another_source")),
anotherSourceArrayFunction(row.getSeq[Row](columnIndex("another_source_array"))),
row.getAs[String](columnIndex("location")),
row.getAs[String](columnIndex("timestamp")),
parseDate(row.getAs[String](columnIndex("delete_time")))
)
}).distinct
spark.createDataFrame(myRDD, dataWriteSchema)
}
another_source_array
column is being processed by anotherSourceArrayFunction
method to make sure we get schema as per the requirements. I need a similar function to get names_source
column. Below is the function that I am using for another_source_array
column.
def anotherSourceArrayFunction(data: Seq[Row]): Seq[Row] = {
if (data == null) {
data
} else {
data.map(r => {
val first = r.getAs[String]("first").ToUpperCase()
val last = r.getAs[String]("last")
new GenericRowWithSchema(Array(first,last), StructType(
StructField("first", StringType) ::
StructField("last", StringType) ::
Nil
))
})
}
}
Probably in short, I need something like this, where I can get my names_source
column structure as a struct.
names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
another_source_array:array<struct<first:string,last:string>>
Above are the column schema required finally. I am able to get another_source_array
properly and need help in names_source
. I think my write schema for this column is correct but I am not sure. But I need finally names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
as column schema.
Note: I am able to get another_source_array
column carefully without any problem. I kept here that function to make it better understanding.