1

I have multiple Map[String, String] in a List (Scala). For example:

map1 = Map("EMP_NAME" -> “Ahmad”, "DOB" -> “01-10-1991”, "CITY" -> “Dubai”)
map2 = Map("EMP_NAME" -> “Rahul”, "DOB" -> “06-12-1991”, "CITY" -> “Mumbai”)
map3 = Map("EMP_NAME" -> “John”, "DOB" -> “11-04-1996”, "CITY" -> “Toronto”)
list = List(map1, map2, map3)

Now I want to create a single dataframe with something like this:

EMP_NAME    DOB             CITY
Ahmad       01-10-1991      Dubai
Rahul       06-12-1991      Mumbai
John        11-04-1996      Toronto

How do I achieve this?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
SAIYED
  • 33
  • 1
  • 6

3 Answers3

4

you can do it like this :

import spark.implicits._

val df = list
  .map( m => (m.get("EMP_NAME"),m.get("DOB"),m.get("CITY")))
  .toDF("EMP_NAME","DOB","CITY")

df.show()

+--------+----------+-------+
|EMP_NAME|       DOB|   CITY|
+--------+----------+-------+
|   Ahmad|01-10-1991|  Dubai|
|   Rahul|06-12-1991| Mumbai|
|    John|11-04-1996|Toronto|
+--------+----------+-------+
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
2

Slightly less specific approach, e.g:

val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
val map2 = Map("EMP_NAME" -> "John",  "DOB" -> "01-10-1992", "CITY" -> "Mumbai")
///...
val list = List(map1, map2) // map3, ...
val RDDmap = sc.parallelize(list)

// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)

// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
                     val list=value.values.toList
                     (list(0), list(1), list(2))
       }.toDF(cols:_*) // dynamic column names assigned

df.show(false)

returns:

+--------+----------+------+
|EMP_NAME|DOB       |CITY  |
+--------+----------+------+
|Ahmad   |01-10-1991|Dubai |
|John    |01-10-1992|Mumbai|
+--------+----------+------+

or to answer your sub-question, here as follows - at least I think this is what you are asking, but probably not:

val RDDmap = sc.parallelize(List(
   Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai"),
   Map("EMP_NAME" -> "John",  "DOB" -> "01-10-1992", "CITY" -> "Mumbai")))
   ...

// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)

// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
                 val list=value.values.toList
                 (list(0), list(1), list(2))
       }.toDF(cols:_*) // dynamic column names assigned

You can build a list dynamically of course, but you still need to assign the Map elements. See Appending Data to List or any other collection Dynamically in scala. I would just read in from file and be done with it.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Thanks man.One more point: how to dynamically loop this (list(0), list(1), list(2)) ? i mean instead of hardcoding 1,2 and 3, can it take from something like list(i) ? – SAIYED Feb 18 '19 at 07:09
  • @thebluephantom, I wouldn't assume results from `.keys` and `.values` of a `Map` will always preserve the KV-pair-wise order. – Leo C Feb 18 '19 at 17:43
  • @LeoC Please elaborate – thebluephantom Feb 18 '19 at 17:58
  • If `m = Map(1->a, 2->b, ...)`, I think it's not safe to assume `m.keys` and `m.values` will for sure have their elements ordered like `1, 2, ...` and `a, b, ...`, respectively, as neither `Map` nor `Set` preserves order. – Leo C Feb 18 '19 at 18:24
  • @LeoC Do you mean that of rows which is well understood to be an issue, can apply to coks gere during flatMap? – thebluephantom Feb 18 '19 at 18:28
  • @LeoC I looked at https://www.tutorialspoint.com/scala/scala_maps.htm and they have a similar example, no mention of your point there. I remember studying this some time ago. – thebluephantom Feb 18 '19 at 18:54
  • @LeoC Just for posterity, not convinced the example here is classic Map use. – thebluephantom Feb 19 '19 at 22:32
  • @thebluephantom, my point is that since elements in `m.keys` and `m.values` cannot be guaranteed to preserve the ordering in the original KV-pair, it's possible the result dataset could be something like `Seq(("a", "c", "b", "d")).toDF("2", "1", "3", "4"))`. – Leo C Feb 19 '19 at 23:07
  • @LeoC how could that occur? that means tutorialspoints has wrong example. what about getting cols from dataframes then? – thebluephantom Feb 19 '19 at 23:11
  • Because `Map` does not preserve order. e.g. `Map(1->"a", 2->"b") == Map(2->"b", 1->"a")`, but `Map(1->"a", 2->"b").keys.toList != Map(2->"b", 1->"a").keys.toList`. – Leo C Feb 19 '19 at 23:28
  • Got it. So it means 2 things, tutorialspoint that I looked at in the past is wrong and my previous point it is bad example of Map use case anyway. – thebluephantom Feb 19 '19 at 23:31
1
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}

object DataFrameTest2 extends Serializable {
  var sparkSession: SparkSession = _
  var sparkContext: SparkContext = _
  var sqlContext: SQLContext = _

  def main(args: Array[String]): Unit = {
    sparkSession = SparkSession.builder().appName("TestMaster").master("local").getOrCreate()
    sparkContext = sparkSession.sparkContext

    val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)

    val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
    val map2 = Map("EMP_NAME" -> "Rahul", "DOB" -> "06-12-1991", "CITY" -> "Mumbai")
    val map3 = Map("EMP_NAME" -> "John", "DOB" -> "11-04-1996", "CITY" -> "Toronto")
    val list = List(map1, map2, map3)

    //create your rows
    val rows = list.map(m => Row(m.values.toSeq:_*))

    //create the schema from the header
    val header = list.head.keys.toList
    val schema = StructType(header.map(fieldName => StructField(fieldName, StringType, true)))

    //create your rdd
    val rdd = sparkContext.parallelize(rows)

    //create your dataframe using rdd
    val df = sparkSession.createDataFrame(rdd, schema)
    df.show()
  }
}
SAIYED
  • 33
  • 1
  • 6
  • The protocol is that you select one of the others answers as being correct, unless no one else supplied or you feel they were inappropriate. – thebluephantom Feb 18 '19 at 11:35
  • I think all the answers are correct in this context. Yours and the first as well. Not sure how to mark multiple correct answers. Also, I was just looking for most generic solution. In real, I will creating and populating dataset for some 40+ colums dynamically. By the way, I really appreciate the solution provided by you :) – SAIYED Feb 18 '19 at 11:52
  • I upvoted your answer multiple time. But this is the message I get: Thanks for the feedback! Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score. :( Looks like I need to build my reputation first :) – SAIYED Feb 18 '19 at 11:59
  • Not sure how to chose. I do not see any label/button to access answer. Is there any link? – SAIYED Feb 18 '19 at 12:05
  • Just click on the grey tick and it will turn green. You can choose only 1. – thebluephantom Feb 18 '19 at 12:07