0

I have around 1Tb of data, I have stored this data in vertices and edge files to be loaded in Spark GraphFrame to create a graph and run motif(pattern finding) queries on this graph.

For every batch, this 1Tb of vertices and edge file needs to be loaded in GraphFrame to create graph and query it.

Problem I have is that creation of graph is slow.So I want to store this created graph to S3/disk, so that from next time I will directly load this graph and run queries on it which will be fast. Is there is any way to do it, ie create huge graph with help of graphframe store it on disk, and from next time directly load this graph in graphframe and query it.

AbhiK
  • 247
  • 3
  • 19
  • You mean some alternative way to saving vertices and edges as described in the doc? https://graphframes.github.io/graphframes/docs/_site/user-guide.html#saving-and-loading-graphframes – mazaneicha Jun 21 '20 at 21:34

1 Answers1

0

Are you sure that the part that is slow is the creation of the GraphFrame?

From my experience, the creation of the GraphFrame object is not really slow, sorry. But the calculation of the motifs is really slow, particularly if you have to calculate on lengths more than 10. The reason is because it's doing a self join on the dataframes that are created under the hood, as you can see from https://www.waitingforcode.com/apache-spark-graphframes/motifs-finding-graphframes/read.

Hope this helps

Oscar Lopez M.
  • 585
  • 3
  • 11
  • I have read that while loading graph from vertex and egde files, graphx/graphframes internally creates triplets, then it proceeds for query execution. Larger the data set more time will be required for creating triplets. Thus I thought If I am able to store huge graph and load it later only for queries I might save some time. – AbhiK Jun 22 '20 at 07:37
  • 1
    What you could do is to store the motifs as Graphframes outputs and use the last step as input for a second round of calculations of motifs from that stage. But this would be so manual and not really efficient. So I think it would be better to use a graph database instead if length is high or other packages like networkX (python based but not use self join as Graphframes) – Oscar Lopez M. Jun 22 '20 at 10:58
  • Lopez M agree with you, found out that graphframes is not a good option for large data set, I am looking for graph db now. – AbhiK Jun 23 '20 at 16:27
  • @AbhiK If you think that's a good answer, can you please mark it as correct answer? Thanks a lot! – Oscar Lopez M. Jul 13 '20 at 11:16