AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark

Question

For my requirement, I need to join data present in PostgreSQL db(hosted in RDS) and file present in S3 bucket. I have created a Glue job(spark-scala) which should connect to both PostgreSQL, S3 bucket and complete processing.

But Glue job encounters connection timeout while connecting to S3(below is error msg). It is successfully fetching data from PostgreSQL.

There is no permission related issue with S3 because I am able to read/write from same S3 bucket/path using different job. The exception/issue happens only if I try to connect both postgreSQL and S3 in one glue job/script.

In Glue job, glue context is created using SparkContext as object. I have tried creating two different sparkSession, each for S3 and postgreSQL db but this approach didnt work. Same timeout issue encountered.

Please help me in resolving the issue.

Error/Exception from log: ERROR[main] glue.processLauncher (Logging.scala:logError(91)):Exception in User Class com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to emp_bucket.s3.amazonaws.com:443 [emp_bucket.s3.amazonaws.com/] failed : connect timed out

Can you provide the keyprefix for the location used in S3 and your policy for the roles in IAM? — jonlegend, Jun 15 '21 at 14:14
This is fixed. Issue was with security group. Only TCP traffic was allowed earlier, as part of the fix traffic was opened for all. Also, added HTTPS rule in inbound rules as well. — Swapnil, Jul 07 '21 at 13:01

score 0 · Answer 1 · answered Jul 07 '21 at 13:04

0

This is fixed.

Issue was with security group. Only TCP traffic was allowed earlier. As part of the fix traffic was opened for all. Also, added HTTPS rule in inbound rules as well.

answered Jul 07 '21 at 13:04

Swapnil

11
1
2

AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark

1 Answers1