1

I am trying to write streaming data to Neo4j using Spark and am having some problems (I am very new to Spark).

I have tried setting up a stream of word counts and can write this to Postgres using a custom ForeachWriter as in the example here. So I think that I understand the basic flow.

I have then tried to replicate this and send the data to Neo4j instead using the neo4j-spark-connector. I am able to send data to Neo4j using the example in the Zeppelin notebook here. So I've tried to transfer this code across to the ForeachWriter but I've got a problem - the sparkContext is not available in the ForeachWriter and from what I have read it shouldn't be passed in because it runs on the driver while the foreach code runs on the executors. Can anyone help with what I should do in this situation?

Sink.scala:

val spark = SparkSession
  .builder()
  .appName("Neo4jSparkConnector")
  .config("spark.neo4j.bolt.url", "bolt://hdp1:7687")
  .config("spark.neo4j.bolt.password", "pw")
  .getOrCreate()

import spark.implicits._

val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

wordCounts.printSchema()

val writer = new Neo4jSink()

import org.apache.spark.sql.streaming.ProcessingTime

val query = wordCounts
  .writeStream
  .foreach(writer)
  .outputMode("append")
  .trigger(ProcessingTime("25 seconds"))
  .start()

query.awaitTermination()

Neo4jSink.scala:

class Neo4jSink() extends ForeachWriter[Row]{

  def open(partitionId: Long, version: Long):Boolean = {
    true
  }

  def process(value: Row): Unit = {

    val word = ("Word", Seq("value"))
    val word_count = ("WORD_COUNT", Seq.empty)
    val count = ("Count", Seq("count"))
    Neo4jDataFrame.mergeEdgeList(sparkContext, value, word, word_count, count)

  }

  def close(errorOrNull:Throwable):Unit = {
  }
}
Lisa
  • 323
  • 1
  • 3
  • 11

0 Answers0