-1

I have been assigned with the task of reading from csv, and creating a ListMap variable. The reason to use this specific class is that for some other use cases they were already using a number of methods with ListMap as input parameter, and they want one more.

What I have done so far is: read from the csv, and create a rdd. The format of the csv is

"field1,field2"
"value1,value2"
"value3,value4"

In this rdd I have tuples of strings. What I would like is to now convert this to a ListMap class. So what I have is a variable with the type Array[(value1,value2),(value3,value4)].

I did this because I find it easy to go from a csv to tuples. The problem is I do not find any way to go from here to a ListMap. It seems easier to get a normal Map class, but as I said, it is required for the final result to be a ListMap type of object.

I have been reading but I do not really understand this answer nor this one

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
monkey intern
  • 705
  • 3
  • 14
  • 34

2 Answers2

1

Array("foo" -> "bar", "baz" -> "bat").toMap gives you a Map. If you are looking for a ListMap specifically (for the life of me, can't think of a reason why you would), then you need a breakOut:

 val map: ListMap[String, String] = 
    Array("foo" -> "bar", "baz" -> "bat")
     .toMap
     .map(identity)(scala.collection.breakOut)

breakOut is sort of a "collection factory" that lets you implicitly convert between different collection types. You can read more about it here: https://docs.scala-lang.org/tutorials/FAQ/breakout.html

Dima
  • 39,570
  • 6
  • 44
  • 70
  • Ooooh so that is what breakOut is. Btw I do not know either. I hope to have time to have it explained to me one of this days... More importantly, I wanted to say I sadly wrote the question wrong. When I read a csv what I get is RDD[(String,String)]. My idea at this point is to collect said RDD, to get an array, and then be able to do what you have taught me. But I have this feeling about how aaaaaaaaall of this is doing it wrong (on my part). – monkey intern Nov 10 '17 at 11:55
  • 1
    It's not "wrong" (except, maybe, for the part of using spark for this in the first place), as long as it fits into memory. And if you want to end up with a single `Map` in the end, it better do :) – Dima Nov 10 '17 at 15:43
  • So Spark is the wrong tool for this, right? I was just talking about this the other day. Hive or Impala would make more sense I assume. Since it's a file we could store straight up in hdfs. What other choices would be recommended for a problem similar to this? And yeah, thanks for the warning about the memory and the map issue. I did look previously into it, it seems to be, for now, a small amount of info, because I was very worried about doing collect() – monkey intern Nov 13 '17 at 07:35
  • You don't need hive (or even hdfs) either. Just drop a file on disk, and read it directly. – Dima Nov 13 '17 at 13:05
  • But to read it you would need some tool right? In this case if the development is being done with spark and scala, does it makes sense to use them, or there is an easier way to incorporate them to the workflow which I am not aware of (which is the more likely possibility haha)? – monkey intern Nov 13 '17 at 13:49
  • 1
    `Source.fromFile("filename.csv").getLines` – Dima Nov 13 '17 at 14:21
1

Depending on the sample data you provided, you can use collectAsMap api to get the final ListMap

val rdd  = sparkSession.sparkContext.textFile("path to the text file")
  .map(line => line.split(","))
  .map(array => array(0) -> array(1))
  .collectAsMap()

Thats it.

Now if you want to go a step further you can do additional step as

  var listMap : ListMap[String, String] = ListMap.empty[String, String]
  for(map <- rdd) {
    listMap += map
  }
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • I ended up using both answers, and they both were incredibly helpful. Any advice on how to reflect that? I guess there is a meta post talking about this issue, there always is haha – monkey intern Nov 13 '17 at 07:36