Convert DataFrame to RDD[Map] in Scala

Question

I want to convert an array created like:

case class Student(name: String, age: Int)
val dataFrame: DataFrame = sql.createDataFrame(sql.sparkContext.parallelize(List(Student("Torcuato", 27), Student("Rosalinda", 34))))

When I collect the results from the DataFrame, the resulting array is an Array[org.apache.spark.sql.Row] = Array([Torcuato,27], [Rosalinda,34])

I'm looking into converting the DataFrame in an RDD[Map] e.g:

Map("name" -> nameOFFirst, "age" -> ageOfFirst)
Map("name" -> nameOFsecond, "age" -> ageOfsecond)

I tried to use map via: x._1 but that does not seem to work for Array [spark.sql.row] How can I anyway perform the transformation?

The context is I want to use spark-jobserver but have some problems regarding serialization of job results. Apparently only a map of string key / values works. The result returned will be an aggregation of several spark queries. So the outer map would kind-of contain further keys. https://groups.google.com/forum/#!topic/spark-jobserver/V4finry_RoM — Georg Heiler, Apr 14 '16 at 09:19
This is a very bad question, with misleading title, bad practice, low quality description. You'll need to work on these stuff when you post questions here — eliasah, Apr 14 '16 at 09:44

score 6 · Accepted Answer · answered Apr 14 '16 at 09:25

6

You can use map function with pattern matching to do the job here

import org.apache.spark.sql.Row

dataFrame
  .map { case Row(name, age) => Map("name" -> name, "age" -> age) }

This will result in RDD[Map[String, Any]]

answered Apr 14 '16 at 09:25

iboss

456
6
18

not working for me I am getting scala.Any error --> Exception in thread "main" java.lang.ClassNotFoundException: scala.Any – sri hari kali charan Tummala Mar 29 '21 at 15:16

score 0 · Answer 2 · answered Oct 24 '20 at 01:10

In other words, you could transform row of dataframe to map, and below works!

def dfToMapOfRdd(df: DataFrame): RDD[Map[String, Any]] = {
    val result: RDD[Map[String, Any]] = df.rdd.map(row => {
        row.getValuesMap[Any](row.schema.fieldNames)
    })
    result
}

refs: https://stackoverflow.com/a/46156025/6494418

Convert DataFrame to RDD[Map] in Scala

2 Answers2

Linked