3
  • Question 1: Can I use tuple as a key of a map in Scala?
  • Question 2: If yes , how can I create a map with a tuple as key?
  • Question 3: I want to convert my scala map to RDD, how would I do in the following case? I am trying to do in this way

    var mapRDD = sc.parallelize(map.toList)
    

    Is this the right way to do ?

  • Question 4: For this particular code snippet, when I do a println on map, it has no values.

I have not included the whole code, basically mapAgainstValue contains userId as key and list of friends as values. I want to recreate a map RDD with the following transformation in the key. What would be the reason for empty map?

var mapAgainstValue = logData.map(x=>x.split("\t")).filter(x => x.length == 2).map(x => (x(0),x(1).split(",")))
     var map:Map[String,List[String]] = Map()
            var changedMap = mapAgainstValue.map{
              line =>
                var key ="";
                for(userIds <- line._2){
                    if(line._1.toInt < userIds.toInt){
                      key =line._1.concat("-"+userIds);
                    }
                    else {
                      key = userIds.concat("-" + line._1);
                    }
                  map += (key -> line._2.toList)
                }
            }
            changedMap.collect()
            map.foreach(println)
dk14
  • 22,206
  • 4
  • 51
  • 88
miniQ
  • 681
  • 1
  • 7
  • 14

1 Answers1

5

Yes, you can use Tuple as a key in Map.

For example:

val userMap = Map(
    (1, 25) -> "shankar",
    (2, 35) -> "ramesh")

Then you can try print the output using foreach

val userMapRDD = sparkContext.parallelize(userMap.toSeq, 2)
  mapRDD.foreach(element => {
    println(element) 
  })

If you want to transform the mapRDD to something else. following code returns only age and name as tuple.

  val mappedRDD = userMapRDD.map {
    case ((empId: Int, age: Int), name: String) => {
      (age, name)
    }
  }
Shankar
  • 8,529
  • 26
  • 90
  • 159
  • 1
    Or even `Map(1 -> 25 -> "shankar", 2 -> 35 -> "ramesh")` – Yawar Oct 22 '16 at 05:10
  • @Yawar : I like the way how you created the map.. +1 – Shankar Oct 22 '16 at 05:14
  • I believe you should use `.collect.foreach(println)` to print RDD: http://spark.apache.org/docs/latest/programming-guide.html#printing-elements-of-an-rdd – dk14 Oct 22 '16 at 05:40
  • @dk14 : `collect` basically returns the array of tuple, then you are doing `foreach` on array elements. Both are same. – Shankar Oct 22 '16 at 05:49
  • @dk14 : also `collect` get all the outputs to driver node, so if the collection is too large, you might face outofmemoryerror, its better to not to use `collect` when you deal with huge data. – Shankar Oct 22 '16 at 05:50
  • 1
    You can use `take` to reduce the amount of data, see [this answer](http://stackoverflow.com/a/23270977/1809978). Anyway, it's a way recommended by Apache's own(!!) documentation – dk14 Oct 22 '16 at 05:54
  • @dk14's point is important when running on an actual cluster and not in local mode: if you use `rdd.foreach(println)`, the printing will happen on the different nodes, so you won't see it on the driver side. When using a toy example with Spark's local mode none of this really matters. – Tzach Zohar Oct 22 '16 at 08:05
  • @TzachZohar The question doesn't specify how many nodes the OP did run (but OP is actually using `collect` - so it matters for him), besides even in local mode you can specify let's say local[4] when creating `SparkContext` - so, in general, it could at least scramble the output – dk14 Oct 22 '16 at 08:58
  • @dk14 I wasn't saying your comment isn't relevant - just wanted to clarify to exact conditions under which it is. Also, when using `local[4]` Spark will still use a _single_ process (the 4 "executors" will just be using different threads within same JVM) so all your data must fit into driver memory whether you use `collect` or not. Bottom line - if this is just a toy example using local mode - it doesn't matter; Otherwise - it matters a lot, as you pointed out. – Tzach Zohar Oct 22 '16 at 09:03
  • 1
    @TzachZohar Yes `local[4]` will use single process and yes you need to reduce amount of data printed (the second one is precisely described in Spark Docs in the link I provided). And even `local[4]` behavior is different from `local[1]` (precisely in ordering: as you've pointed out operations are going to run in parallel). What I meant is that we mostly don't know whether it's real or not, so it's better to provide answer that works in the worst case. – dk14 Oct 22 '16 at 09:17
  • @dk14 using take() worked for me, atleast I got the values in my map. – miniQ Oct 22 '16 at 13:34
  • @dk14 I used local[2] in my config. – miniQ Oct 22 '16 at 13:48