Why does nested flatMap - map in Scala give an RDD of type Object instead of a list of tuples?

Question

I have an rdd that I want to group according to some key, but it just doesn't work. I am a Scala and Spark beginner So I have the following RDD:


rdd: RDD[WikipediaArticle])

val meinVal = rdd.flatMap(article=>langs.map(lang=>{if (article.mentionsLanguage(lang){ Tuple2(lang,article)} else{None}})).filter(_!=None)

meinVal.collect.foreach(println) gives:

(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))

I have two questions:

Why can I not apply the groupByKey function? It is an rdd that contains a list of tuples, the first tuple-entry is the key.
I don't see how to apply groupby either. I thought I could do meinVal.groupby(x=> x._1), but that trows an error.

I notice, that when I use an IDE and hover over "meinVal" it shows that it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I do not know how to get this information without the IDE. So it seems that the rdd contains just one big object. I only don't see why that is.

Anyone? Please?

Irene

You want `collect` instead of `map + filter` – Luis Miguel Mejía Suárez May 05 '22 at 17:42 — Luis Miguel Mejía Suárez, May 05 '22 at 17:42

score 0 · Answer 1 · answered May 06 '22 at 07:26

Ok, so thanks to this post https://stackoverflow.com/a/29426336/909909 I figured it out. The problem was not the nested flatmap-map construct, but the condition in the map instruction. In my code I returned "None" if the condition was not met. Since None is not of type tuple I get an RDD[Object] and therefore I cannot use groupByKey. To solve this I use Option and then flatten the rdd to get rid of the Option and its Nones again.

val meinVal = rdd.flatMap( article=> langs.map(lang=> {  if(article.mentionsLanguage(lang)){Some(Tuple2(lang,article))}else{None}}).flatten)

You can use `flatMap` instead of `map` + `flatten` or you can just use `collect` to avoid the need of `Option` — Luis Miguel Mejía Suárez, May 06 '22 at 14:15

Why does nested flatMap - map in Scala give an RDD of type Object instead of a list of tuples?

1 Answers1