I have a spark dataframe with the following schema and class data:
>ab
ab: org.apache.spark.sql.DataFrame = [block_number: bigint, collect_list(to): array<string> ... 1 more field]
>ab.printSchema
root |-- block_number: long (nullable = true)
|-- collect_list(to): array (nullable = true)
| |-- element: string (containsNull = true)
|-- collect_list(from): array (nullable = true)
| |-- element: string (containsNull = true)
I want to simply merge the arrays from these two columns. I have tried to find a simple solution for this online but have not had any luck. Basically my issue comes down to two problems.
First, I know that probably the solution involves the map function. I have not been able to find any syntax that can actually compile, so for now please accept my best attempt:
ab.rdd.map(
row => {
val block = row.getLong(0)
val array1 = row(1).getAs[Array<string>]
val array1 = row(1).getAs[Array<string>]
}
)
Basically issue number 1 is very simple, and an issue that has been recurring since the day I first started using map in Scala: I can't figure out how to extract an arbitrary field for an arbitrary type from a column. I know that for the primitive types you have things like row.getLong(0)
etc, but I don't understand how this should be done for things like array types.
I have seen somewhere that something like row.getAs[Array<string>](1)
should work, but when I try it I get the error
error: identifier expected but ']' found.
val array1 = row.getAs[Array<string>](1)`
As far as I can tell, this is exactly the syntax I have seen in other situations but I can't tell why it's not working. I think I have seen before some other syntax that looks like row(1).getAs[Type]
, but I am not sure.
The second issue is: once I can extact these two arrays, what is the best way of merging them? Using the intersect function? Or is there a better approach to this whole process? For example using the brickhouse package?
Any help would be appreciated.
Best,
Paul