4

My bigdata infrastructure contains Airflow and EMR running in two separate clusters. Currently the data ETL steps are as follows,

  1. Sqoop data on to an Airflow worker (hadoop 2.7 is installed here in pseudo distributed mode)
  2. Sync data to S3
  3. Access data on S3 using Spark on EMR (EMR is running hadoop 3.2.1)

In an attempt to streamline the ETL process, I feel that the second step is completely unnecessary and that it should be possible to directly load data through sqoop in to S3 (sqoop command will be executed on the Airflow worker).

But when I set the sqoop --target-dir parameter to an S3 URL, the sqoop job crashes with java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3. I have attempted many fixes to overcome this issue but none have been successful so far. Things I have tried are,

  1. Trying to point sqoop to use hadoop on EMR instead of the local pseudo distributed hadoop
  2. Copying possible dependency jar files from EMR to Sqoop libs such as emrfs-hadoop-assembly, hadoop-common and hadoop-hdfs
  3. Different AWS protocols such as s3, s3a and s3n

I'm confident that I have done all configurations properly to the best of my knowledge. Is there something that I have missed? Or is it a Sqoop limitation which doesn't allow direct loading to S3?

Rukshan Hassim
  • 505
  • 6
  • 15

1 Answers1

0

You can resolve it by following these steps: https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/

Moohebat
  • 455
  • 1
  • 6
  • 15