1

We are using Flink 1.9.0 Dataset API to read CSV files from Amazon S3 Bucket. Facing connection pool timeout most of the times. Following are the configurations at Flink level

  1. Reading 19708 objects from s3 in a single go, as we need to apply the logic on top of whole data set. Lets take an eg: Imagine have 20 source folders eg( AAA, BBB, CCC ) with multiple subfolders (AAA/4May2020/../../1.csv,AAA/4May2020/../../2.csv, AAA/3May2020/../../1.csv ,AAA/3May2020/../../2.csv ....), for the read to happen, before calling the readCSV, the logic scan folders and pick the one only with latest date folder and pass that for read. For the read operation we use parallelism as "5". But when the execution graph is formed all 20 Sources comes together.

  2. Running on Kube-Aws with around 10 Task Managers hosted under "m5.4X large machine". Task Manager docker is allocated with "8" cores and "50GB" memory.

Following were tried to address the issue, but no luck so far. Really need some pointers and help to address this

  • Enabled the Flink retry mechanism with failover as "region", sometimes with retries it gets through. But even with retry it fails intermittently.
  • Revisited the core-site.xml as per AWS Site: fs.s3a.threads.max:3000,fs.s3a.connection.maximum:4500 Also could anyone help with the following questions

  • Is there anyway to check if the HTTP connections opened by readCSV
    are closed

  • Any pointers to understand how dataset ReadCSV operates will help.
  • Any way to introduce a wait mechanisms before the read?
  • Any better way to address this issue
  • Could you check if this [thread](https://stackoverflow.com/questions/56695660/unable-to-execute-http-request-timeout-waiting-for-connection-from-pool-in-flin) helps you? – Arvid Heise May 07 '20 at 07:17

0 Answers0