11

I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers).

Unfortunately, AWS Glue doesn't seem to support running inside user defined VPCs. AWS does provide something called Glue Database Connections which, when used with the Glue SDK, magically set up elastic network interfaces inside the specified VPC for Glue/Spark worker nodes. The network interfaces then tunnel traffic from Glue to a specific database inside the VPC. However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.

Is there a reliable way to setup a Glue -> VPC connection that will tunnel all traffic through a VPC?

Turiphro
  • 357
  • 1
  • 3
  • 11
  • Isn't there vpc [interface endpoint for glue](https://docs.aws.amazon.com/glue/latest/dg/vpc-endpoint.html)? – Marcin May 01 '20 at 10:54
  • 1
    @Marcin VPC endpoints only allow resources running inside the VPC, in private subnets without Internet access, to be able to call the AWS Glue API. That allows network connections originating from inside the VPC to access Glue, this question is about allowing connections originating inside Glue to access the VPC. – Mark B May 01 '20 at 13:10
  • @MarkB I see. Thanks. I misunderstood the question. – Marcin May 01 '20 at 13:15
  • 2
    Yeap, database connection is the only way to do it, and it doesn't have to be valid, check this [Cloudformation Sample](https://github.com/aws-samples/amazon-redshift-commands-using-aws-glue/blob/master/RedshiftCommands.yaml#L279-L283) published by AWS that runs a Python shell job, which simply creates bogus connection, but with correct subnet details. – Farid Nouri Neshat Jul 07 '20 at 18:22
  • This temporary approach worked for me. I have MongoDB 4.x in my VPC. I created a glue connection for it in the GUI. The test connection failed (AWS is troubleshooting) but my VPC settings are correct. I created a new job with "Catalog options" > "Use Glue data catalog as the Hive metastore" option checked. Next, I chose the glue connection I just setup. This allowed me to connect to MongoDB from within my script using: elasticsearch-spark-20_2.11-7.10.1.jar. In theory, when the glue connection is fixed, the job will be re-written to use it. But this work-around is running successfully. – Matthew Dec 11 '20 at 17:21

2 Answers2

9

You can create a database connection with NETWORK connection type and use that connection in your Glue job. It will allow your job to call a REST API or any other resource within your VPC.

enter image description here

https://docs.aws.amazon.com/glue/latest/dg/connection-using.html

Network (designates a connection to a data source within an Amazon Virtual Private Cloud environment (Amazon VPC))

enter image description here

https://docs.aws.amazon.com/glue/latest/dg/connection-JDBC-VPC.html

To allow AWS Glue to communicate with its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC and not open it to all networks.

enter image description here

Oleksandr Lykhonosov
  • 1,138
  • 12
  • 25
1

However, this requires the location and credentials of specific databases, and it is not clear if and when other traffic (e.g., a REST call to a service) is tunnelled through the VPC.

I agree the documentation is confusing, but according to this paragraph on the page you linked, it appears that all traffic is indeed tunneled through the VPC, since you have to have a NAT Gateway or VPC endpoints to allow Glue to access things outside the VPC once you have configured it with VPC access:

All JDBC data stores that are accessed by the job must be available from the VPC subnet. To access Amazon S3 from within your VPC, a VPC endpoint is required. If your job needs to access both VPC resources and the public internet, the VPC needs to have a Network Address Translation (NAT) gateway inside the VPC.

Mark B
  • 183,023
  • 24
  • 297
  • 295
  • The trouble is that you can't "configure Glue with VPC access"; this link is not configured anywhere. The link only appears implicitly if you use a predefined Connection. You can try a bogus connection and ignore exceptions, but that doesn't sound like solid engineering. – Turiphro May 01 '20 at 13:52
  • Yes I agree, but if you don't already have a data source you want to connect to in the VPC, why do you want the Glue job to run in the VPC? – Mark B May 01 '20 at 14:38
  • The goal is to access internal REST services (Fargate). – Turiphro May 01 '20 at 14:45
  • There's also a database that's queried in some cases; the SDK only allows data sinks (not SQL queries) and the credentials rotate (so should not be fixed during "Glue Connection" creation). It would help to have the VPC connection separated from their internal database sink SDK. The two seem like different features to me. – Turiphro May 01 '20 at 14:53
  • 1
    I agree, the features should be separate. Given the current Glue API it looks like you are stuck doing a fake JDBC connection I guess, but like you said that is far from ideal. All you can do is submit a feature request to AWS. – Mark B May 01 '20 at 15:42