2

I am trying to establish a connection from AWS Glue to a remote server via SFTP using Python 3.7. I tried using the pysftp library for this task.

But pysftp uses a library named bcrypt that has python and c code. As of this moment, AWS Glue only supports pure python libraries as mentioned in the documentation (below link).

https://docs.aws.amazon.com/glue/latest/dg/console-custom-created.html

The error I am getting is as below.

ImportError: cannot import name '_bcrypt'

I am stuck here due to a compilation error.

Hence, I tried the JSch java library using Scala. There the compilation is successful, but I get the below exception.

com.jcraft.jsch.JSchException: java.net.UnknownHostException: [Remote Server Hostname]

How can we connect to a remote server via SFTP from AWS Glue? Is it possible?

How can we configure outbound rules (if required) for a Glue job?

Amlan Alok
  • 99
  • 2
  • 8

2 Answers2

6

I am answering my own question here for anyone whom this might help.

The straight answer is no.

I found the below resources which indicate that AWS Glue is an ETL tool for AWS resources.

AWS Glue uses other AWS services to orchestrate your ETL (extract, transform, and load) jobs to build a data warehouse.

Source - https://docs.aws.amazon.com/glue/latest/dg/how-it-works.html

Glue works well only with ETL from JDBC and S3 (CSV) data sources. In case you are looking to load data from other cloud applications, File Storage Base, etc. Glue would not be able to support.

Source - https://hevodata.com/blog/aws-glue-etl/

Hence to implement what I was working on, I used an AWS Lambda function to connect to the remote server via SFTP, pick the required files and drop them in an S3 bucket. The AWS Glue job can now pick the files from S3.

Amlan Alok
  • 99
  • 2
  • 8
  • What would be the trigger for Lamba here please. Also can you please post sample code as I am struck by the same error bcrypt. – Yuva Apr 30 '21 at 14:01
  • 2
    Hi @Yuva I had scheduled the lambda trigger from cloudwatch at a specific UTC time. I do not have the code at the moment. From recollection, I used python 3.7 for the lambda function and pysftp library for the SFTP connection. I was able to easily find some code examples of this library on Google. The pysftp library was added as a layer in the lambda function. – Amlan Alok May 01 '21 at 18:29
  • am looking for pysftp library to build, but getting some dependent issues, such as cffi, bycrypt, etc. Am debugging them. Thanks for your reply – Yuva May 02 '21 at 07:21
  • 1
    I used an ubuntu VM in EC2 to create the compressed zip file containing the pysftp library to create the lambda layer. I used this video. This is for Pandas but you can follow the same steps for pysftp - https://www.youtube.com/watch?v=zrrH9nbSPhQ&list=PLB0ozhyre8fd-F7hWHJIavjuGhjz1FJQd&index=8&t=1086s – Amlan Alok May 05 '21 at 15:55
  • 1
    Thanks, yes i used a LINUX EC2 to create pysftp library to run on lambda, which is working fine. The only point to keep in mind is we have to use the same python version for lambda, and the EC2 instance. Otherwise we would pysftp dependent libraries conflicts. – Yuva May 06 '21 at 06:10
-1

i know that there is some time since this question was post, so i like to share some tools that could help you to get data from a sftp more easily and quickly. so for get a layer in a easy way use this tool https://github.com/aws-samples/aws-lambda-layer-builder, you can make a layer of pysftp faster and free of those annoying errors (cffi, bycrypt).

The lambda has a limit of 500 MB,so if you are trying to extract heavy files, the lambda will crash for this reason. to fix this you have to attach EFS (Elastic File System) to your lamdba https://docs.aws.amazon.com/lambda/latest/dg/services-efs.html