Scala DataFrame: Explode an array

Question

I am using the spark libraries in Scala. I have created a DataFrame using

val searchArr = Array(
  StructField("log",IntegerType,true),
  StructField("user", StructType(Array(
    StructField("date",StringType,true),
    StructField("ua",StringType,true),
    StructField("ui",LongType,true))),true),
  StructField("what",StructType(Array(
    StructField("q1",ArrayType(IntegerType, true),true),
    StructField("q2",ArrayType(IntegerType, true),true),
    StructField("sid",StringType,true),
    StructField("url",StringType,true))),true),
  StructField("where",StructType(Array(
    StructField("o1",IntegerType,true),
    StructField("o2",IntegerType,true))),true)
)

val searchSt = new StructType(searchArr)    

val searchData = sqlContext.jsonFile(searchPath, searchSt)

I am now what to explode the field what.q1, which should contain an array of integers, but the documentation is limited: http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#explode(java.lang.String,%20java.lang.String,%20scala.Function1,%20scala.reflect.api.TypeTags.TypeTag)

So far I tried a few things without much luck

val searchSplit = searchData.explode("q1", "rb")(q1 => q1.getList[Int](0).toArray())

Any ideas/examples of how to use explode on an array?

The documentation you're looking at is 1.4.0. Is that the version of spark you are using? — Jeremy, Sep 14 '18 at 19:37

score 0 · Answer 1 · answered Nov 09 '16 at 13:59

Did you try with an UDF on field "what"? Something like that could be useful:

val explode = udf {
(aStr: GenericRowWithSchema) => 
  aStr match {
      case null => ""
      case _  =>  aStr.getList(0).get(0).toString()
  }
}


val newDF = df.withColumn("newColumn", explode(col("what")))

where:

getList(0) returns "q1" field
get(0) returns the first element of "q1"

I'm not sure but you could try to use getAs[T](fieldName: String) instead of getList(index: Int).

score 0 · Answer 2 · answered Sep 01 '22 at 06:23

I'm not used to Scala; but in Python/pyspark, the array type column nested within a struct type field can be exploded as follows. If it works for you, then you can convert it to corresponding Scala representation.

from pyspark.sql.functions import col, explode
from pyspark.sql.types import ArrayType, IntegerType, LongType, StringType, StructField, StructType

schema = StructType([
  StructField("log", IntegerType()),
  StructField("user", StructType([
    StructField("date", StringType()),
    StructField("ua", StringType()),
    StructField("ui", LongType())])),
  StructField("what", StructType([
    StructField("q1", ArrayType(IntegerType())),
    StructField("q2", ArrayType(IntegerType())),
    StructField("sid", StringType()),
    StructField("url", StringType())])),
  StructField("where", StructType([
    StructField("o1", IntegerType()),
    StructField("o2", IntegerType())]))
])

data = [((1), ("2022-01-01","ua",1), ([1,2,3],[6],"sid","url"), (7,8))]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)

Output:

+---+-------------------+--------------------------+------+
|log|user               |what                      |where |
+---+-------------------+--------------------------+------+
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|
+---+-------------------+--------------------------+------+

With what.q1 exploded:

df.withColumn("what.q1_exploded", explode(col("what.q1"))).show(truncate=False)

Output:

+---+-------------------+--------------------------+------+----------------+
|log|user               |what                      |where |what.q1_exploded|
+---+-------------------+--------------------------+------+----------------+
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|1               |
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|2               |
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|3               |
+---+-------------------+--------------------------+------+----------------+

Scala DataFrame: Explode an array

2 Answers2