0

I have an rdd that I want to group according to some key, but it just doesn't work. I am a Scala and Spark beginner So I have the following RDD:


rdd: RDD[WikipediaArticle])

val meinVal = rdd.flatMap(article=>langs.map(lang=>{if (article.mentionsLanguage(lang){ Tuple2(lang,article)} else{None}})).filter(_!=None)

meinVal.collect.foreach(println) gives:

(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))

I have two questions:

  1. Why can I not apply the groupByKey function? It is an rdd that contains a list of tuples, the first tuple-entry is the key.

  2. I don't see how to apply groupby either. I thought I could do meinVal.groupby(x=> x._1), but that trows an error.

I notice, that when I use an IDE and hover over "meinVal" it shows that it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I do not know how to get this information without the IDE. So it seems that the rdd contains just one big object. I only don't see why that is.

Anyone? Please?

Irene

Sallos
  • 61
  • 1
  • 6

1 Answers1

0

Ok, so thanks to this post https://stackoverflow.com/a/29426336/909909 I figured it out. The problem was not the nested flatmap-map construct, but the condition in the map instruction. In my code I returned "None" if the condition was not met. Since None is not of type tuple I get an RDD[Object] and therefore I cannot use groupByKey. To solve this I use Option and then flatten the rdd to get rid of the Option and its Nones again.

val meinVal = rdd.flatMap( article=> langs.map(lang=> {  if(article.mentionsLanguage(lang)){Some(Tuple2(lang,article))}else{None}}).flatten)
Sallos
  • 61
  • 1
  • 6