0

I know Spark will not allow you to use functions that generate RDDs inside of map or any of it's variants. Is there a work around for this? For instance, can I perform a standard looping iterations of all RDDs in a partition. (For instance is there a method to convert an RDD to a list on each node, so that each node contains a list of the entries it was carrying?)

I'm trying to do some graph work with graphframes in pyspark and it's currently not possible to do what I want.

Dylan Lawrence
  • 1,503
  • 10
  • 32
  • @zero323 I don't think so since it still returns an RDD. – Dylan Lawrence Mar 04 '17 at 17:12
  • In that case I don't understand what you're trying to achieve :) – zero323 Mar 04 '17 at 17:14
  • @zero323 I have a list of all vertices in pairs. (i.e. cartesian product) and I want to find the shortest path between all vertices. I've read that I can use Floyd-Warshall for this but in graphframes even the `find` method returns dataframes so I'm unsure how to best iterate the rdd of pairs. – Dylan Lawrence Mar 04 '17 at 22:01
  • How large is the graph? – zero323 Mar 04 '17 at 23:10
  • @zero323 Too big for me to want to coalesce or put it on the driver. – Dylan Lawrence Mar 04 '17 at 23:15
  • Is it fully connected? – zero323 Mar 04 '17 at 23:33
  • @zero323 I'm unsure, I believe that mathematically I can not guarantee connectivity. – Dylan Lawrence Mar 04 '17 at 23:35
  • I've been thinking more about identifying connected components to reduce the problem so to opposite would be more useful. Anyway... Floyd–Warshall can be implemented with message passing so you should be able to use Pregel API but I don't think that GraphFrames will be very useful for you here, unless you want to iterate vertex by vertex. – zero323 Mar 05 '17 at 00:42

0 Answers0