0

I wonder what is the best way to read data from csv file (located on S3) and then insert into database table.

I have deployed apache flink on my k8s cluster.

I have tried with DataSet api in the following way:

Source(Read csv) -> Map(Transform POJO to Row) -> Sink(JdbcOutputFormat)

It seems that Sink (writing into DB) is the bottleneck. Source and Map tasks are idle for ~80% while at the same time Sink is idle for 0ms/1s with input rate rate 1.6MB/s.

I can only speed up the whole operation of inserting csv content into my database by spliting the whole operation on new replicas of task managers.

Is there any room for improving performance of my jdbc sink?

[edit]

DataSource<Order> orders = env.readCsvFile("path/to/file") //
                .pojoType(Order.class, pojoFields)
                .setParallelism(6) //
                .name("Read csv"); //

JDBCOutputFormat jdbcOutput = JDBCOutputFormat.buildJDBCOutputFormat() 
                .setQuery("INSERT INTO orders(...) values (...)") //
                .setBatchInterval(10000) //
                .finish();

orders.map(order -> {
            Row r = new Row(29);
            //assign values from Order pojo to Row
            return r;
       }).output(jdbcOutput).name("Postgre SQL Output");

I have experimented with batch interval in range 100-50000 but it didn't affect speed of processing significantly, it's still 1.4-1.6MB/s

If instead of writing to external database I print all entries from csv file to stdout (print()) I get rate 6-7MB/s so this is why I assumed the problem is with jdbc sink.

With this post just wanted to make sure my code doesn't have any performance issues and I reach max performance from a single Task Manager.

Ardelia Lortz
  • 72
  • 1
  • 8
  • Perhaps the answer to https://stackoverflow.com/questions/66487033/flink-jdbc-sink will help you. – David Anderson Mar 16 '21 at 18:08
  • @DavidAnderson I have experimented with batch interval in range 100-50000 but it didn't affect speed of processing significantly, it's still 1.4-1.6MB/s For comparison, instead of writing to external database I print all entries from csv file to stdout (print()) I get rate 6-7MB/s so this is why I assumed the problem is with jdbc sink. With this post just wanted to make sure my code doesn't have any performance issues so the problem is probably in database system (limit per transaction?) Haven't tried Table API yet. – Ardelia Lortz Mar 16 '21 at 21:57

0 Answers0