I am working with a large graph (3M nodes and 1B relations between the nodes). I have two type of nodes, categories and users. I want to use spark to analyze the graph data to perform path analysis between categories for example.
But I have the following inquires if anyone can help with these:
1) Do I need to load the whole graph to do the analysis on spark? So I tried to load the nodes list and edges into spark graphframes
using the following scala code
val nodesQuery="match (n:category) RETURN id(n) as id,n.userid as user_id,n.catid as cat_id limit UNION ALL MATCH (n:user) RETURN id(n) as id,n.userid as user_id,n.catid as cat_id limit"
val relsQuery="match (p:category) optional match (p:category)-[r]-(n:user) return id(p) as src,id(n) as dst, type(r) as value val graphFrame = neo.nodes(nodesQuery,Map.empty).rels(relsQuery,Map.empty).loadGraphFrame"
The first issue with this I get null values for the users nodes in nodes list also a memory overflow occurs, any suggestions on this?
The reason I decided to use GraphFrames
because the queries are supposedly optimized but with RDDs I can load data in batches
3) What are the possible suggestions to perform distance analysis on this data (I need to measure distance between 2 categories) with a cypher code like the following:
MATCH path=(cat1:category{catid:'1900'}) -[rel1:INTERESTED_IN] -(user1:user) -[rel2:INTERESTED_IN*2..3] -(cat2:category{catid:'1700'}) return cat1,path,cat2,rel1
4) will using Message passing via AggregateMessages
help? Would I still need to load the whole graph into spark?