1

I am new to Flink and Elastic Search integration. I have a scenario where i have to load history data(approx 1TB) from an old elastic search cluster(5.6) to new cluster(6.8). I have to do some data filtering and modification during the migration. Thinking about using a flink batch job with flink-es-sink operator.

But since there is no flink-es-source operator currently available , whats the best way to source the data into my flink pipeline. I have couple of options to do this.

  1. Write a flatmap function/process function and get the record
  2. Use some open source 3rd party libraries for connecting flink to ES. But dont want to take risk because dont know how these programs performs

But not sure which is the best way, since the data size is huge i might have to parallelize the source operator .

Please suggest few options if any of you have come across this scenario . Thanks in advance

Abhi
  • 69
  • 6
  • Flink's batch (DataSet) API can be used with any hadoop input format. That might be a solution. – David Anderson Sep 10 '20 at 09:38
  • But again how will i solve the source parallelism , because i dint want to read the same data again and again from the elastic search indices if the parallelism > 1 right ? can you provide little bit more details on how to solve this issue using data api – Abhi Sep 10 '20 at 17:23
  • https://stackoverflow.com/questions/63747019/create-input-format-of-elasticsearch-using-flink-rich-inputformat and https://stackoverflow.com/questions/54329298/elasticsearch-connector-as-source-in-flink may help. And you might use `getRuntimeContext().getIndexOfThisSubtask` if needed to identify each parallel instance. – David Anderson Sep 10 '20 at 17:54

0 Answers0