2

I am trying to stream data from a file to elastic search using akka streams and elastic4s.

I have a Movie object than can be inserted into elastic search and am able to index objects of this type using the httpclient:

val httpClient = HttpClient(ElasticsearchClientUri("localhost", 9200))
val response = Await.result(httpClient.execute {
  indexInto("movies" / "movie").source(movie)
}, 10 seconds)
println(s"result:  $response")
httpClient.close()

Now I am trying to use akka streams to index Movie objects.

I have a function to create the sink:

def toElasticSearch(client: HttpClient)(implicit actorSystem: ActorSystem): Sink[Movie, NotUsed] = {
  var count = 0

  implicit val movieImporter = new RequestBuilder[Movie] {
    import com.sksamuel.elastic4s.http.ElasticDsl._
    def request(movie: Movie): BulkCompatibleDefinition = {
      count = count + 1
      println(s"inserting ${movie.id} -> ${movie.title} - $count")
      index("movies", "movie").source[Movie](movie)
    }
  }
  val subscriber = client.subscriber[Movie](
                                            batchSize=10
                                            , concurrentRequests = 2
                                            , completionFn = () => {println(s"completion: all done")}
                                            , errorFn = (t: Throwable) => println(s"error: $t")
                                           )
  Sink.fromSubscriber(subscriber)
}

and a test:

describe("a DataSinkService elasticsearch sink") {
  it ("should write data to elasticsearch using an http client") {
    var count = 0
    val httpClient = HttpClient(ElasticsearchClientUri("localhost", 9200))
    val graph = GraphDSL.create(sinkService.toElasticSearch(httpClient)) { implicit builder: GraphDSL.Builder[NotUsed] => s =>
      val flow: Flow[JsValue, Movie, NotUsed] = Flow[JsValue].map[Movie](j => {
        val m = Movie.fromMovieDbJson(j)
        count = count + 1
        println(s"parsed id:${m.id} - $count")
        m
      })
      sourceService.fromFile(3, 50) ~> flow ~> s
      ClosedShape
    }
    RunnableGraph.fromGraph(graph).run
    Thread.sleep(20.seconds.toMillis)
    println(s"\n*******************\ndone waiting...\n")
    httpClient.close()
    println(s"closed")
  }
}

I send 47 elements sourceService.fromFile(3, 50) The output shows:

  1. 20 elements processed (parsed in the flow and indexed in the sink)
  2. done waiting
  3. closed
  4. completion: all done (the completionFn)

If I change the parameters of the subscriber batchSize and concurrentRequests to be 12 and 3 respectively, I see 36 elements parsed and indexed.

So it appears as if the sink stops accepting elements after the batchSize * concurrentRequests.

My questions are:

  1. Does the elastic4s streaming solution work when using an httpClient
  2. what am I missing
Jeffrey Chung
  • 19,319
  • 8
  • 34
  • 54
Doug Anderson
  • 245
  • 1
  • 2
  • 10
  • My first tip would be to stop using `Thread.sleep()` but rather use `scala.concurrent.Await` directly or whatever the test frameworks provides to handle Futures. – Frederic A. Aug 11 '17 at 04:16
  • I changed the code to no longer use ```Thread.sleep()``` but the sink still quits (throws ```java.net.ConnectException: Connection refused```) after the ```batchSize * concurrentRequests ``` – Doug Anderson Aug 11 '17 at 17:58
  • So the connection to elasticsearch fails. Shouldn't the port be 9300? – Frederic A. Aug 12 '17 at 02:35
  • actually the connection to elasticsearch is good. Since I am using the http client, the port should be 9200. I am able to write the ```batchSize*concurrentRequests``` but the stream stops after that many writes to elasticsearch, even though there are more elements that the source can emit – Doug Anderson Aug 14 '17 at 14:04
  • 1
    Over the weekend, I gave up on using the stream implementation provided by elastic4s and wrote an actor based sink that buffers up to some batch size of elements and then makes a bulk call using the http client provided by elastic4s. I still need to firm up the error handling but it appears to fill the buffer, index them all and then receive more elements from the stream. It was able to repeat this process until the source had no more elements to emit. Sample code is available on [github](https://github.com/andersondk7/elasticsearch) – Doug Anderson Aug 14 '17 at 14:06
  • I've created an issue here and I'll look into this as part of the 6.0 release train. https://github.com/sksamuel/elastic4s/issues/1030 – sksamuel Aug 17 '17 at 22:38

0 Answers0