3

For a current ETL job, I am trying to create a Python Shell Job in Glue. The transformed data needs to be persisted in DocumentDB. I am unable to access the DocumentDB from Glue.

Since the DocumentDb cluster resides in a VPC, I thought of creating a Interface gateway to access the Document DB from Glue but DocumentDB was not one of the approved service in Interface gateway. I see tunneling as a suggested option but I do not wanna do that.

So, I want to know is there a way to connect to DocumentDB from Glue.

Rahul
  • 44,892
  • 25
  • 73
  • 103
  • Can't glue [create ENIs to access resources](https://docs.aws.amazon.com/glue/latest/dg/start-connecting.html) in a vpc? – Marcin May 18 '20 at 10:49

3 Answers3

2

Create a dummy JDBC connection in AWS Glue. You will not need to do a test connection but this will allow ENIs to be created in the VPC. Attach this connection to your python shell job. This will allow you to interact with your resources.

Eman
  • 831
  • 5
  • 8
  • I did try this. But the issue I am facing is, once I set up a connection(MongoDB) to the database, I am not able to install the additional libraries. In my case, I use PyMongo and when I add connection, whl file does not pull the library from the cloud but when I do not explicitly add the connection, the library is pulled in automatically when I run the job. – Rahul May 18 '20 at 14:49
  • One thing you should be aware of when working with connections in Glue is, elastic network interface is assigned a private IP address from the IP address range within the subnet you specified. No public IP addresses are assigned. Set up a nat gateway and it should allow the job to connect to the internet from a private subnet: https://aws.amazon.com/premiumsupport/knowledge-center/nat-gateway-vpc-private-subnet/ – Eman May 18 '20 at 22:44
1

Have you tried using the mongo db connection in glue connections, we can connect document db through that option.

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
1

I have been able to connect DocumentDb with glue and ingest data using a csv in S3, here's the script to do that

# Constants
data_catalog_database = 'sample-db'
data_catalog_table = 'data'

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

spark_context = SparkContext()
glue_context = GlueContext(spark_context)
job = Job(glue_context)
job.init(args['JOB_NAME'], args)

# Read from data source
## @type: DataSource
## @args: [database = "glue-gzip", table_name = "glue_gzip"]
## @return: dynamic_frame
## @inputs: []
dynamic_frame = glue_context.create_dynamic_frame.from_catalog(
    database=data_catalog_database,
    table_name=data_catalog_table
)

documentdb_write_uri = 'mongodb://yourdocumentdbcluster.amazonaws.com:27017'
write_documentdb_options = {
    "uri": documentdb_write_uri,
    "database": "yourdbname",
    "collection": "yourcollectionname",
    "username": "###",
    "password": "###"
}

# Write DynamicFrame to MongoDB and DocumentDB
glue_context.write_dynamic_frame.from_options(dynamic_frame, connection_type="documentdb",
                                             connection_options=write_documentdb_options)

In summary:

  1. Create a crawler that creates the schema of your data and a table, which can be stored in an S3 bucket.
  2. Use that db and table to ingest it into your documentdb.
gab
  • 325
  • 1
  • 12
  • 1
    Created a crawler that created the schema of your data and a table with source as mongodb in glue catalog. When i tried to create dynamic frame from catalog it threw me the following error IllegalArugmentException : missing database name. Set via spark.mongo.input.uri........ – user8866279 May 03 '21 at 05:42