GCP dataproc cluster hadoop job to move data from gs bucket to s3 amazon bucket fails [CONSOLE]

Question

First question on stack overflow, so please forgive me for any rookie mistakes.

I am currently working on moving a very large sum of data (700+ GiB) consisting of many small files of about 1-10MB each, from a folder in GCS bucket to a folder in s3.

Several attempts I made:

gsutil -m rsync -r gs://<path> s3://<path> Results in a timeout due to large sums of data
gsutil -m cp -r gs://<path> s3://<path> Takes way too long. Even with many parallel processes and/or threads it still transfers at about 3.4MiB/s on average. I have made sure to upgrade the VM instance in this attempt.
using rclone Same performance issue as cp

Recently I have found another probable method of doing this. However I am not familiar with GCP so please bear with me, sorry. This is the reference I found https://medium.com/swlh/transfer-data-from-gcs-to-s3-using-google-dataproc-with-airflow-aa49dc896dad The method involves making a dataproc cluster through GCP console with the following configuration:

Name:
    <dataproc-cluster-name>
Region:
    asia-southeast1
Nodes configuration:
    1 main 2 worker @2vCPU & @3.75GBMemory & @30GBPersistentDisk
properties:
    core    fs.s3.awsAccessKeyId        <key>
    core    fs.s3.awsSecretAccessKey    <secret>
    core    fs.s3.impl                  org.apache.hadoop.fs.s3.S3FileSystem

Then I submit the job through the console menu in GCP website:

At this moment, I start noticing issues, I cannot find hadoop-mapreduce/hadoop-distcp.jar anywhere. I can only find /usr/lib/hadoop/hadoop-distcp.jar by browsing root files through my main dataproc cluster VM instance
The job I submit:

Start time:
31 Mar 2021, 16:00:25
Elapsed time:
3 sec
Status:
Failed
Region
asia-southeast1
Cluster
<cluster-name>
Job type
Hadoop
Main class or JAR
file://usr/lib/hadoop/hadoop-distcp.jar
Arguments
-update
gs://*
s3://*

Returns an error

/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2400: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2365: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2460: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_OPTS: invalid variable name
2021-03-31 09:00:28,549 ERROR tools.DistCp: Invalid arguments: 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2638)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3342)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3374)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:126)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3425)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3393)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:240)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:143)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:441)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
    at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2542)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2636)
    ... 18 more
Invalid arguments: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                       Reuse existing data in target files and
                               append new data to them if possible
 -async                        Should distcp execution be blocking
 -atomic                       Commit all changes or none
 -bandwidth <arg>              Specify bandwidth per map in MB, accepts
                               bandwidth as a fraction.
 -blocksperchunk <arg>         If set to a positive value, fileswith more
                               blocks than this value will be split into
                               chunks of <blocksperchunk> blocks to be
                               transferred in parallel, and reassembled on
                               the destination. By default,
                               <blocksperchunk> is 0 and the files will be
                               transmitted in their entirety without
                               splitting. This switch is only applicable
                               when the source file system implements
                               getBlockLocations method and the target
                               file system implements concat method
 -copybuffersize <arg>         Size of the copy buffer to use. By default
                               <copybuffersize> is 8192B.
 -delete                       Delete from target, files missing in
                               source. Delete is applicable only with
                               update or overwrite options
 -diff <arg>                   Use snapshot diff report to identify the
                               difference between source and target
 -direct                       Write files directly to the target
                               location, avoiding temporary file rename.
 -f <arg>                      List of files that need to be copied
 -filelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n
 -filters <arg>                The path to a file containing a list of
                               strings for paths to be excluded from the
                               copy.
 -i                            Ignore failures during copy
 -log <arg>                    Folder on DFS where distcp execution logs
                               are saved
 -m <arg>                      Max number of concurrent maps to use for
                               copy
 -numListstatusThreads <arg>   Number of threads to use for building file
                               listing (max 40).
 -overwrite                    Choose to overwrite target files
                               unconditionally, even if they exist.
 -p <arg>                      preserve status (rbugpcaxt)(replication,
                               block-size, user, group, permission,
                               checksum-type, ACL, XATTR, timestamps). If
                               -p is specified with no <arg>, then
                               preserves replication, block size, user,
                               group, permission, checksum type and
                               timestamps. raw.* xattrs are preserved when
                               both the source and destination paths are
                               in the /.reserved/raw hierarchy (HDFS
                               only). raw.* xattrpreservation is
                               independent of the -p flag. Refer to the
                               DistCp documentation for more details.
 -rdiff <arg>                  Use target snapshot diff report to identify
                               changes made on target
 -sizelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n bytes
 -skipcrccheck                 Whether to skip CRC checks between source
                               and target paths.
 -strategy <arg>               Copy strategy to use. Default is dividing
                               work based on file sizes
 -tmp <arg>                    Intermediate work path to be used for
                               atomic commit
 -update                       Update target, copying only missing files
                               or directories
 -v                            Log additional info (path, size) in the
                               SKIP/COPY log
 -xtrack <arg>                 Save information about missing source files
                               to the specified directory

How can I fix this problem? Several fixes I find online aren't very helpful. Either they were using hadoop cli or have different jar files as mine. For example this one right here: Move data from google cloud storage to S3 using dataproc hadoop cluster and airflow and https://github.com/CoorpAcademy/docker-pyspark/issues/13

Disclaimers: I do not use hadoop cli or airflow. I use console to do this, submitting job through the dataproc cluster main VM instance shell also returns the same error. If this is required, any detailed reference would be appreciated, thankyou very much!

Update:

Fixed misleading path replacement on gsutil part
The problem was due to s3FileSystem no longer supported by hadoop. So I have to downgrade to an image with hadoop 2.10 [FIXED]. The speed however, was also not satisfactory

score 0 · Answer 1 · answered Apr 08 '21 at 17:17

I think the Dataproc solution is overkill in your case. Dataproc would make sense if you needed to daily or hourly copy something like a TB of data from GCS to S3. But it sounds like yours will just be a one-time copy that you can let run for hours or days. I'd suggest running gsutil on a Google Cloud (GCP) instance. I've tried an AWS EC2 instance for this and it is always markedly slower for this particular operation.

Create your source and destination buckets in the same region. For example, us-east4 (N. Virginia) for GCS and us-east-1 (N. Virginia) for S3. Then deploy your instance in the same GCP region.

gsutil -m cp -r gs://* s3://*

. . . is probably not going to work. It definitely does not work in Dataproc, which always errors if I don't have either an explicit file location or a bucket/folder that ends with /

Instead, first try to explicitly copy one file successfully. Then try a whole folder or bucket.

How many files are you trying to copy?

Hi! Thanks for the response. Just today I found out that the problem was due to hadoop 3.2 not supporting s3 connector, therefore i need to go back to hadoop 2.10. About the path: I am sorry, (*) means some path i just replaced, it has an actual directory. As for your question: I am trying to copy millions of files, and ultimately, it it too slow. My last attempt (gsutil in GCP Instance) results in 3.5 MiB/s transfer speed at an ETA of 150 hours! Duration needs to be way shorter.As this is an enterprise problem, the bucket regions are unchangable, at least for now. Thanks — Ernest, Apr 09 '21 at 08:16

GCP dataproc cluster hadoop job to move data from gs bucket to s3 amazon bucket fails [CONSOLE]

1 Answers1