I have a time series currently stored as a graph (using a time tree structure, similar to this) in a Neo4j server instance, version 2.3.6 (so REST interface only, no Bolt). What I am trying to do is to perform some analytics of these time series in a distributed way, using PySpark.
Now, I am aware of existing projects to connect Spark with Neo4j, especially the ones listed here. The problem with these is that they focus on creating an interface to work with graphs. In my case graphs are not relevant, since my Neo4j Cypher queries are meant to produce arrays of values. Everything downstream is about handling these arrays as time series; again, not as graph.
My question is:
has anybody successfully queried a REST-only Neo4j instance in parallel using PySpark, and if yes, how did you do it?
The py2neo library seemed like a good candidate until I realized the connection object could not be shared across partitions (or if it can, I do not know how). Right now I'm considering having my Spark jobs run independent REST queries on the Neo4j server, but I want to see how the community may have solved this problem.