6

I am using the spark libraries in Scala. I have created a DataFrame using

val searchArr = Array(
  StructField("log",IntegerType,true),
  StructField("user", StructType(Array(
    StructField("date",StringType,true),
    StructField("ua",StringType,true),
    StructField("ui",LongType,true))),true),
  StructField("what",StructType(Array(
    StructField("q1",ArrayType(IntegerType, true),true),
    StructField("q2",ArrayType(IntegerType, true),true),
    StructField("sid",StringType,true),
    StructField("url",StringType,true))),true),
  StructField("where",StructType(Array(
    StructField("o1",IntegerType,true),
    StructField("o2",IntegerType,true))),true)
)

val searchSt = new StructType(searchArr)    

val searchData = sqlContext.jsonFile(searchPath, searchSt)

I am now what to explode the field what.q1, which should contain an array of integers, but the documentation is limited: http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#explode(java.lang.String,%20java.lang.String,%20scala.Function1,%20scala.reflect.api.TypeTags.TypeTag)

So far I tried a few things without much luck

val searchSplit = searchData.explode("q1", "rb")(q1 => q1.getList[Int](0).toArray())

Any ideas/examples of how to use explode on an array?

Jaume Primer
  • 61
  • 1
  • 6

2 Answers2

0

Did you try with an UDF on field "what"? Something like that could be useful:

val explode = udf {
(aStr: GenericRowWithSchema) => 
  aStr match {
      case null => ""
      case _  =>  aStr.getList(0).get(0).toString()
  }
}


val newDF = df.withColumn("newColumn", explode(col("what")))

where:

  • getList(0) returns "q1" field
  • get(0) returns the first element of "q1"

I'm not sure but you could try to use getAs[T](fieldName: String) instead of getList(index: Int).

pheeleeppoo
  • 1,491
  • 6
  • 25
  • 29
0

I'm not used to Scala; but in Python/pyspark, the array type column nested within a struct type field can be exploded as follows. If it works for you, then you can convert it to corresponding Scala representation.

from pyspark.sql.functions import col, explode
from pyspark.sql.types import ArrayType, IntegerType, LongType, StringType, StructField, StructType

schema = StructType([
  StructField("log", IntegerType()),
  StructField("user", StructType([
    StructField("date", StringType()),
    StructField("ua", StringType()),
    StructField("ui", LongType())])),
  StructField("what", StructType([
    StructField("q1", ArrayType(IntegerType())),
    StructField("q2", ArrayType(IntegerType())),
    StructField("sid", StringType()),
    StructField("url", StringType())])),
  StructField("where", StructType([
    StructField("o1", IntegerType()),
    StructField("o2", IntegerType())]))
])

data = [((1), ("2022-01-01","ua",1), ([1,2,3],[6],"sid","url"), (7,8))]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)

Output:

+---+-------------------+--------------------------+------+
|log|user               |what                      |where |
+---+-------------------+--------------------------+------+
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|
+---+-------------------+--------------------------+------+

With what.q1 exploded:

df.withColumn("what.q1_exploded", explode(col("what.q1"))).show(truncate=False)

Output:

+---+-------------------+--------------------------+------+----------------+
|log|user               |what                      |where |what.q1_exploded|
+---+-------------------+--------------------------+------+----------------+
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|1               |
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|2               |
|1  |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|3               |
+---+-------------------+--------------------------+------+----------------+
Azhar Khan
  • 3,829
  • 11
  • 26
  • 32