3

I am working with a large graph (3M nodes and 1B relations between the nodes). I have two type of nodes, categories and users. I want to use spark to analyze the graph data to perform path analysis between categories for example.

But I have the following inquires if anyone can help with these:

1) Do I need to load the whole graph to do the analysis on spark? So I tried to load the nodes list and edges into spark graphframes using the following scala code

val nodesQuery="match (n:category) RETURN id(n) as id,n.userid as user_id,n.catid as cat_id limit UNION ALL MATCH (n:user) RETURN id(n) as id,n.userid as user_id,n.catid as cat_id limit"
val relsQuery="match (p:category) optional match (p:category)-[r]-(n:user) return id(p) as src,id(n)  as dst, type(r) as value  val graphFrame = neo.nodes(nodesQuery,Map.empty).rels(relsQuery,Map.empty).loadGraphFrame"

The first issue with this I get null values for the users nodes in nodes list also a memory overflow occurs, any suggestions on this?

The reason I decided to use GraphFrames because the queries are supposedly optimized but with RDDs I can load data in batches

3) What are the possible suggestions to perform distance analysis on this data (I need to measure distance between 2 categories) with a cypher code like the following:

MATCH path=(cat1:category{catid:'1900'}) -[rel1:INTERESTED_IN] -(user1:user) -[rel2:INTERESTED_IN*2..3] -(cat2:category{catid:'1700'}) return cat1,path,cat2,rel1

4) will using Message passing via AggregateMessages help? Would I still need to load the whole graph into spark?

Richard Telford
  • 9,558
  • 6
  • 38
  • 51

0 Answers0