How to extract an array column from spark dataframe

Question

I have a spark dataframe with the following schema and class data:

>ab
ab: org.apache.spark.sql.DataFrame = [block_number: bigint, collect_list(to): array<string> ... 1 more field]

>ab.printSchema
root |-- block_number: long (nullable = true) 
     |-- collect_list(to): array (nullable = true) 
     | |-- element: string (containsNull = true) 
     |-- collect_list(from): array (nullable = true) 
     | |-- element: string (containsNull = true)

I want to simply merge the arrays from these two columns. I have tried to find a simple solution for this online but have not had any luck. Basically my issue comes down to two problems.

First, I know that probably the solution involves the map function. I have not been able to find any syntax that can actually compile, so for now please accept my best attempt:

ab.rdd.map(
  row => {
  val block = row.getLong(0)
  val array1 = row(1).getAs[Array<string>]
  val array1 = row(1).getAs[Array<string>]
  }
)

Basically issue number 1 is very simple, and an issue that has been recurring since the day I first started using map in Scala: I can't figure out how to extract an arbitrary field for an arbitrary type from a column. I know that for the primitive types you have things like row.getLong(0) etc, but I don't understand how this should be done for things like array types.

I have seen somewhere that something like row.getAs[Array<string>](1) should work, but when I try it I get the error

error: identifier expected but ']' found.
  val array1 = row.getAs[Array<string>](1)`

As far as I can tell, this is exactly the syntax I have seen in other situations but I can't tell why it's not working. I think I have seen before some other syntax that looks like row(1).getAs[Type], but I am not sure.

The second issue is: once I can extact these two arrays, what is the best way of merging them? Using the intersect function? Or is there a better approach to this whole process? For example using the brickhouse package?

Any help would be appreciated.

Best,

Paul

thank you for pointing that out! I tried googling every permutation I could think of on this question, and was surprised not to find any simple explanation. Alas, this is a perfect answer, but it doesn't show up in the top results if you search "extract array type column in spark dataframe"... however "access array type column in spark dataframe" shows it as the 5th result. ;/ — Paul, Nov 23 '17 at 21:46

Raphael Roth · Answer 1 · 2017-11-24T06:09:50.937

You don't need to switch to the RDD API, you can do it with Dataframe UDFs like this:

val mergeArrays = udf((arr1:Seq[String],arr2:Seq[String]) => arr1++arr2)

df
  .withColumn("merged",mergeArrays($"collect_list(from)",$"collect_list(to)"))
  .show()

The above UDF just concats the array (using the ++ operator), you can also use union or intersect etc, depending what you want to achieve.

Using the RDD API, the solution would look like this:

df.rdd.map(
     row => {
       val block = row.getLong(0)
       val array1 = row.getAs[Seq[String]](1)
       val array2 = row.getAs[Seq[String]](2)
       (block,array1++array2)
     }
  ).toDF("block","merged") // back to Dataframes

How to extract an array column from spark dataframe

1 Answers1