I wonder what is the best way to read data from csv file (located on S3) and then insert into database table.
I have deployed apache flink on my k8s cluster.
I have tried with DataSet
api in the following way:
Source(Read csv) -> Map(Transform POJO to Row) -> Sink(JdbcOutputFormat)
It seems that Sink (writing into DB) is the bottleneck. Source
and Map
tasks are idle for ~80% while at the same time Sink
is idle for 0ms/1s with input rate rate 1.6MB/s.
I can only speed up the whole operation of inserting csv content into my database by spliting the whole operation on new replicas of task managers.
Is there any room for improving performance of my jdbc sink?
[edit]
DataSource<Order> orders = env.readCsvFile("path/to/file") //
.pojoType(Order.class, pojoFields)
.setParallelism(6) //
.name("Read csv"); //
JDBCOutputFormat jdbcOutput = JDBCOutputFormat.buildJDBCOutputFormat()
.setQuery("INSERT INTO orders(...) values (...)") //
.setBatchInterval(10000) //
.finish();
orders.map(order -> {
Row r = new Row(29);
//assign values from Order pojo to Row
return r;
}).output(jdbcOutput).name("Postgre SQL Output");
I have experimented with batch interval in range 100-50000 but it didn't affect speed of processing significantly, it's still 1.4-1.6MB/s
If instead of writing to external database I print all entries from csv file to stdout (print())
I get rate 6-7MB/s so this is why I assumed the problem is with jdbc sink.
With this post just wanted to make sure my code doesn't have any performance issues and I reach max performance from a single Task Manager.