Sqoop import directly to S3 bucket from celery airflow worker

Question

My bigdata infrastructure contains Airflow and EMR running in two separate clusters. Currently the data ETL steps are as follows,

Sqoop data on to an Airflow worker (hadoop 2.7 is installed here in pseudo distributed mode)
Sync data to S3
Access data on S3 using Spark on EMR (EMR is running hadoop 3.2.1)

In an attempt to streamline the ETL process, I feel that the second step is completely unnecessary and that it should be possible to directly load data through sqoop in to S3 (sqoop command will be executed on the Airflow worker).

But when I set the sqoop --target-dir parameter to an S3 URL, the sqoop job crashes with java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3. I have attempted many fixes to overcome this issue but none have been successful so far. Things I have tried are,

Trying to point sqoop to use hadoop on EMR instead of the local pseudo distributed hadoop
Copying possible dependency jar files from EMR to Sqoop libs such as emrfs-hadoop-assembly, hadoop-common and hadoop-hdfs
Different AWS protocols such as s3, s3a and s3n

I'm confident that I have done all configurations properly to the best of my knowledge. Is there something that I have missed? Or is it a Sqoop limitation which doesn't allow direct loading to S3?

score 0 · Answer 1 · answered Jul 05 '22 at 12:02

0

You can resolve it by following these steps: https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/

answered Jul 05 '22 at 12:02

Moohebat

455
1
6
15

Sqoop import directly to S3 bucket from celery airflow worker

1 Answers1