I am trying to write streaming data to Neo4j using Spark and am having some problems (I am very new to Spark).
I have tried setting up a stream of word counts and can write this to Postgres using a custom ForeachWriter as in the example here. So I think that I understand the basic flow.
I have then tried to replicate this and send the data to Neo4j instead using the neo4j-spark-connector. I am able to send data to Neo4j using the example in the Zeppelin notebook here. So I've tried to transfer this code across to the ForeachWriter but I've got a problem - the sparkContext is not available in the ForeachWriter and from what I have read it shouldn't be passed in because it runs on the driver while the foreach code runs on the executors. Can anyone help with what I should do in this situation?
Sink.scala:
val spark = SparkSession
.builder()
.appName("Neo4jSparkConnector")
.config("spark.neo4j.bolt.url", "bolt://hdp1:7687")
.config("spark.neo4j.bolt.password", "pw")
.getOrCreate()
import spark.implicits._
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
wordCounts.printSchema()
val writer = new Neo4jSink()
import org.apache.spark.sql.streaming.ProcessingTime
val query = wordCounts
.writeStream
.foreach(writer)
.outputMode("append")
.trigger(ProcessingTime("25 seconds"))
.start()
query.awaitTermination()
Neo4jSink.scala:
class Neo4jSink() extends ForeachWriter[Row]{
def open(partitionId: Long, version: Long):Boolean = {
true
}
def process(value: Row): Unit = {
val word = ("Word", Seq("value"))
val word_count = ("WORD_COUNT", Seq.empty)
val count = ("Count", Seq("count"))
Neo4jDataFrame.mergeEdgeList(sparkContext, value, word, word_count, count)
}
def close(errorOrNull:Throwable):Unit = {
}
}