0

I have a users: RDD[(Long, Vertex)] collection of users. I want to create links between my Vertex objects. The rule is: if two Vertex have the same value in a selected property - call it prop1, then a link exists.

My problem is how to check for every pair in the same collection. If I do:

val rels = users.map(
  x => users.map(y => if(x._2.prop1 == y._2.prop1){(x._1, y._1)}))

I got back an RDD[RDD[Any]] and not a RDD[(Long, Long)] as expected for the Graph to work

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
user299791
  • 2,021
  • 3
  • 31
  • 57
  • You have to write else block for your if so the resulting type to be (Long,Long) - One of the issues – Nyavro Nov 19 '15 at 10:07
  • do you know another way to link together people that share a common attribute? – user299791 Nov 19 '15 at 10:10
  • sorry but I don't get your previous comment, there is no "else", if they don't match, no link should be returned – user299791 Nov 19 '15 at 10:13
  • If you don't specify else statement the result type is going to be Any as you specified RDD[RDD[Any]]. So if you don't have else statement you better use collection filter by equal property – Nyavro Nov 19 '15 at 10:17
  • 2
    I would group collection of users by prop1. And get desired pairs from groups – Nyavro Nov 19 '15 at 10:19

1 Answers1

1

First of all you cannot start an action of a transformation from an another action or transformation not to mention create nested RDDs. So it is simply impossible you get RDD[RDD[Any]].

What you need here is most likely a simple join roughly equivalent to something like this where T is a type of the property1:

val pairs: RDD[(T, Long)] = users.map{ case (id, v) => (v.prop1, id) }
val links: RDD[(Long, Long)] = pairs
  .join(pairs)  // join by a common property, equivalent to INNER JOIN in SQL
  .values  // drop properties
  .filter{ case (v1, v2) => v1 != v2 }  // filter self-links
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
zero323
  • 322,348
  • 103
  • 959
  • 935
  • thanks for the answer, I reporter RDD[RDD[Any]] as a consequence of Eclipse error warning: type mismatch; found : org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD[Any]] required: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[?]] Error occurred in an application involving default arguments. – user299791 Nov 19 '15 at 10:29
  • 1
    Yeah, never believe IDEs :) – zero323 Nov 19 '15 at 10:30
  • 2
    so now I get that you were not ironical in your message, apology for over-reacting... :) – user299791 Nov 19 '15 at 10:32
  • Don't worry... Still it is Spark 101 and it is not only about matching types. If it wasn't for that you could simply replace outer `map` with a `flatMap` internal `map` with `filter`. – zero323 Nov 19 '15 at 10:37
  • @digitaldust Thanks, fixed. – zero323 Nov 19 '15 at 11:00
  • I just realised that the outcome should be in the form Edge(3L, 7L, "collab") so something like RDD[Edge[]] to feed the graphx constructor... I will accept anyway your answer because it's perfect for my actual question – user299791 Nov 19 '15 at 11:07
  • 1
    You can simply add another `map` after the filter. If you want to keep property and use it as a edge label then remove `values` call an filter / map directly. – zero323 Nov 19 '15 at 11:13