0

I have the following problem. I have a dataframe "vert" in spark, consisting of three columns: Origin (String), Destination (String), Distance (Integer). So, it's simple the data about flights between different cities. For example it could look like this:

Chicago Houston 670
London Chicago 1200
...

I want to create the corresponding graph in GraphX and I want to take the distances as edge attributes to the graph. So first I have to define the edges rdd. I found the following way to do this:

val ed = vert.rdd
  .map(x => ((MurmurHash.stringHash(x(0).toString), MurmurHash.stringHash(x(1).toString)), 1))
  .reduceByKey(_+_)
  .map(x => Edge(x._1._1, x._1._2, x._2))

Unfortunately this command only takes the columns Origin and Destination into account and ignores the column Distance, so that I have no Information about the distances in the rdd "ed". How have I to change the command so that I have also the distances in rdd?

Sorry if it is a stupid question and thanks in advance.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Logic_Problem_42
  • 229
  • 2
  • 11
  • 1
    Do you need a `reduceByKey` here? I.e. does the dataframe contain the same pair of cities multiple times (seems unlikely). – Shaido May 02 '18 at 09:32
  • No, I have no duplicate rows. Is the reduceByKey only for the purpose of eliminating duplicates here? I thought that reduceByKey makes some aggregation. – Logic_Problem_42 May 02 '18 at 09:36
  • Oh, I have actually found the way. I can simply replace 1 with x(2) in my command. – Logic_Problem_42 May 02 '18 at 09:40
  • 1
    Yes, it will aggregate. But your current code will create a key based on the origin and destination columns, the `reduceByKey` will then add the number of rows that have the same cities (since you have 1 as values and `_+_` in the reduce. This will effectivly give you a dataframe with one row per origin/destination pair and a value representing the number of rows in the original dataframe these pair occur in. – Shaido May 02 '18 at 09:42
  • 1
    You can simply do: `map(x => Edge(MurmurHash.stringHash(x(0).toString), MurmurHash.stringHash(x(1).toString), x(2))` directly for the same result. – Shaido May 02 '18 at 09:43
  • By the way, the Edge method seems to accept only one attribute.Do You know how I can define several attributes? Or can I simply use the list of attributes as an argument of Edge? – Logic_Problem_42 May 02 '18 at 09:54
  • 1
    The easiest would be to use a tuple or a case class, you can see a more detailed answer of mine to this question here: https://stackoverflow.com/questions/46680128/spark-graphx-add-multiple-edge-weights/46680501#46680501 – Shaido May 02 '18 at 09:58

0 Answers0