Highest Voted 's3distcp' Questions

1

vote

1 answer

Error parsing parameter, amazon aws emr

I'm trying to create a step by Linux console: aws emr add-steps --cluster-id j-XXXXXXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\…

asked Jul 12 '16 at 10:06

David

119
1
12

0

votes

1 answer

GCS Connector on EMR failing with java.lang.ClassNotFoundException

I have created an emr cluster with the instructions on how to create a connection from gcs provided here and keep running the hadoop distcp command. It keeps failing with the following error: 2023-07-25 12:00:40,113 INFO mapreduce.Job: Task Id :…

amazon-web-services google-cloud-storage amazon-emr distcp s3distcp

asked Jul 26 '23 at 07:05

Abhinav Rai

1
2

0

votes

0 answers

Running s3distcp from EMR to Kerberized Hadoop cluster

We have some copy jobs that use s3distcp which runs on EMR cluster deployed on EKS pod. These jobs copy data from S3 to HDFS and vice versa. We are unable to run this after getting the Hadoop cluster kerberized. Tried multiple options of passing…

hadoop amazon-eks kerberos amazon-emr s3distcp

asked Jun 08 '23 at 16:45

Kanhu Roul

1

0

votes

0 answers

Can we use S3DistCp command from AWS Glue to bulk copy data to S3 with from Glue using pyspark?

We are using Glue to process big data workloads. I have a requirement where in I have around 500,000 records being processed in Glue using G2.X workers. We have partitioned our S3 destination bucket with certain prefixes where this data has to be…

amazon-web-services pyspark aws-glue s3distcp

asked Oct 12 '22 at 17:50

Vijeth Kashyap

179
1
11

0

votes

1 answer

One single distcp command to upload several files to s3 (NO DIRECTORY)

I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not…

python hadoop distcp s3distcp

asked Feb 23 '22 at 21:08

tprebenda

389
1
6
17

0

votes

1 answer

How do I reproduce checksum of gzip files copied with s3DistCp (from Google Cloud Storage to AWS S3)

I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue). If I compare the sizes (bytes)…

amazon-web-services amazon-s3 gzip checksum s3distcp

asked Feb 21 '22 at 16:37

Dolan Antenucci

15,432
17
74
100

0

votes

2 answers

How to grab all hive files after a certain date for s3 upload (python)

I'm writing a program for a daily upload to s3 of all our hive tables from a particular db. This database contains records from many years ago, however, and is way too large for a full copy/distcp. I want to search the entire directory in HDFS that…

hadoop hdfs ls distcp s3distcp

asked Jan 27 '22 at 17:34

tprebenda

389
1
6
17

0

votes

0 answers

How does speculative execution impact s3-dist-cp job?

I have noticed that sometimes s3-dist-cp takes much longer than usual due to a "slow node" issue. In case of spark I have enabled speculative execution which works fine. Howerver, when it comes to s3-dist-cp I would like to understand possible…

hadoop amazon-s3 amazon-emr s3distcp

asked Jul 07 '21 at 15:09

Grzes

971
1
13
28

0

votes

0 answers

Does s3-dist-cp on EMR uses EMR consistent view metadata?

I'm using EMR consistent view feature on EMR when running some of my Hive queries. Now I need to access and copy objects directly from s3 using s3-dist-cp bypassing Hive interface which uses EMRFS consistent view metadata stored in DynamoDB. When I…

amazon-web-services amazon-s3 amazon-emr eventual-consistency s3distcp

asked Aug 28 '20 at 14:48

marknorkin

3,904
10
46
82

0

votes

2 answers

How to read and repartition a large dataset from one s3 location to another using spark, s3Distcp & aws EMR

I am trying to move data in s3 which is partitioned on a date string at rest(source) to another location where it is partitioned at rest (destination) as year=yyyy/month=mm/day=dd/ While I am able to read the entire source location data in Spark and…

apache-spark hadoop amazon-s3 amazon-emr s3distcp

asked Aug 11 '20 at 14:28

rajgumma

21
2

0

votes

0 answers

How to copy files from s3 to s3 same folder?

I am trying to combine log files from s3 to s3 using the following command. s3-dist-cp --src s3://path/to/ym=2020/ --dest s3://path/to/ym=2020/ --groupBy='.*/(\d{8}).+(\.json) --deleteOnSuccess' I have the following files…

amazon-s3 amazon-emr s3distcp

asked Aug 04 '20 at 08:08

Pentel

11
1
3

0

votes

1 answer

Performance issue with AWS EMR S3DistCp

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20…

amazon-web-services performance hadoop amazon-emr s3distcp

asked Jun 05 '20 at 23:37

Fabien Roussel

225
1
4
10

0

votes

0 answers

How to copy large number of smaller files from EMR (Hdfs) to S3 bucket?

I've a large csv file with below details: total records: 20 million total columns: 45 total file size: 8 GB I am trying to process this csv file using Apache Spark (distributed computing engine) on AWS EMR. I am partitioning this csv file based on…

apache-spark amazon-s3 amazon-emr distcp s3distcp

asked May 31 '20 at 11:18

TheCodeCache

820
1
7
27

0

votes

2 answers

How to uncompress file while loading from HDFS to S3?

I have csv files in lzo format in HDFS I would like to load these files in to s3 and then to snowflake, as snowflake does not provides lzo compression for csv file format, I am required to convert it on the fly while loading these files to s3.

hadoop snowflake-cloud-data-platform distcp s3distcp

asked May 20 '20 at 20:02

Vishrant

15,456
11
71
120

0

votes

1 answer

JSON aggregation using s3-dist-cp for Spark application consumption

My spark application running on AWS EMR loads data from JSON array stored in S3. The Dataframe created from it is then processed via Spark engine. My source JSON data is in the form of multiple S3 objects. I need to compact them into a JSON array to…

apache-spark-sql amazon-emr distcp s3distcp

asked Apr 07 '20 at 19:54

user1988501

27
1
3

Questions tagged [s3distcp]