-1

my RDD contains actually some biological data which is protein names, and the similarity degree between them. I would like to create graph where vertices are proteins and edges represent the similarity values. Here's actually my RDD:

+-------------+------------+------------+
|   Protein1  |  Protein2  | Similarity |
+-------------+------------+------------+
|    P28469   |   Q70UP5   | 0.11111111 |
|    O45687   |   P00325   |    1.0     |
|    A7ME43   |   Q5HG16   |    0.6     |
|    A4VJT7   |   Q9LD43   |    1.0     |
|    P31937   |   Q64415   | 0.07692308 |
|    A1VAA0   |   Q9L298   |    1.0     |
|    B8DG74   |   Q6MT35   |    1.0     |
+-------------+------------+------------+

Thank you!

amelie
  • 25
  • 7

1 Answers1

0

Not the same data, but you need to do it like this (from file of course) and adapt this approach to your data:

// Vertex DataFrame
val v = sqlContext.createDataFrame(List(
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
)).toDF("id", "name", "age")
// Edge DataFrame
val e = sqlContext.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
)).toDF("src", "dst", "relationship")

val g = GraphFrame(v, e) 

In your case:

// i remember your question on distinct, but not sure if we need ditinct or not
// you talk about RDD but looks like a dataframe, let us assume RDD

//RDD tuple, simulated from file
val rdd = sc.parallelize(Array(("p1", "p2", 1), 
                               ("p1", "p3", 2), 
                               ("p2", "p4", 3), 
                               ("p5", "p6", 4)))
val v = rdd.map(x => x._1).union(rdd.map(x => x._2)).distinct.toDF("protein")
v.collect
val e = rdd.map(x => (x._1, x._2, x._3)).toDF("protein1", "protein2", "similarity")

v.show(false)
e.show(false)

val g = GraphFrame(v, e) 
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • i have already my data in myrdd i have to split it and define which are the vertices and which are the edges, it is different from what you had written – amelie Jul 01 '20 at 14:08
  • Different data. Just the principle. – thebluephantom Jul 01 '20 at 14:32
  • no i mean the princple two.. by your code you' had insered the data and used GraphFrame to build your graph, in my case i have the data originally in a csv file which i convert it into an RDD and i'm searching which function i can use it. – amelie Jul 01 '20 at 14:36