Questions tagged [s3distcp]

60 questions
1
vote
1 answer

Error parsing parameter, amazon aws emr

I'm trying to create a step by Linux console: aws emr add-steps --cluster-id j-XXXXXXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\…
David
  • 119
  • 1
  • 12
0
votes
1 answer

GCS Connector on EMR failing with java.lang.ClassNotFoundException

I have created an emr cluster with the instructions on how to create a connection from gcs provided here and keep running the hadoop distcp command. It keeps failing with the following error: 2023-07-25 12:00:40,113 INFO mapreduce.Job: Task Id :…
0
votes
0 answers

Running s3distcp from EMR to Kerberized Hadoop cluster

We have some copy jobs that use s3distcp which runs on EMR cluster deployed on EKS pod. These jobs copy data from S3 to HDFS and vice versa. We are unable to run this after getting the Hadoop cluster kerberized. Tried multiple options of passing…
0
votes
0 answers

Can we use S3DistCp command from AWS Glue to bulk copy data to S3 with from Glue using pyspark?

We are using Glue to process big data workloads. I have a requirement where in I have around 500,000 records being processed in Glue using G2.X workers. We have partitioned our S3 destination bucket with certain prefixes where this data has to be…
Vijeth Kashyap
  • 179
  • 1
  • 11
0
votes
1 answer

One single distcp command to upload several files to s3 (NO DIRECTORY)

I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not…
tprebenda
  • 389
  • 1
  • 6
  • 17
0
votes
1 answer

How do I reproduce checksum of gzip files copied with s3DistCp (from Google Cloud Storage to AWS S3)

I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue). If I compare the sizes (bytes)…
Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
0
votes
2 answers

How to grab all hive files after a certain date for s3 upload (python)

I'm writing a program for a daily upload to s3 of all our hive tables from a particular db. This database contains records from many years ago, however, and is way too large for a full copy/distcp. I want to search the entire directory in HDFS that…
tprebenda
  • 389
  • 1
  • 6
  • 17
0
votes
0 answers

How does speculative execution impact s3-dist-cp job?

I have noticed that sometimes s3-dist-cp takes much longer than usual due to a "slow node" issue. In case of spark I have enabled speculative execution which works fine. Howerver, when it comes to s3-dist-cp I would like to understand possible…
Grzes
  • 971
  • 1
  • 13
  • 28
0
votes
0 answers

Does s3-dist-cp on EMR uses EMR consistent view metadata?

I'm using EMR consistent view feature on EMR when running some of my Hive queries. Now I need to access and copy objects directly from s3 using s3-dist-cp bypassing Hive interface which uses EMRFS consistent view metadata stored in DynamoDB. When I…
0
votes
2 answers

How to read and repartition a large dataset from one s3 location to another using spark, s3Distcp & aws EMR

I am trying to move data in s3 which is partitioned on a date string at rest(source) to another location where it is partitioned at rest (destination) as year=yyyy/month=mm/day=dd/ While I am able to read the entire source location data in Spark and…
rajgumma
  • 21
  • 2
0
votes
0 answers

How to copy files from s3 to s3 same folder?

I am trying to combine log files from s3 to s3 using the following command. s3-dist-cp --src s3://path/to/ym=2020/ --dest s3://path/to/ym=2020/ --groupBy='.*/(\d{8}).+(\.json) --deleteOnSuccess' I have the following files…
Pentel
  • 11
  • 1
  • 3
0
votes
1 answer

Performance issue with AWS EMR S3DistCp

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20…
0
votes
0 answers

How to copy large number of smaller files from EMR (Hdfs) to S3 bucket?

I've a large csv file with below details: total records: 20 million total columns: 45 total file size: 8 GB I am trying to process this csv file using Apache Spark (distributed computing engine) on AWS EMR. I am partitioning this csv file based on…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
0
votes
2 answers

How to uncompress file while loading from HDFS to S3?

I have csv files in lzo format in HDFS I would like to load these files in to s3 and then to snowflake, as snowflake does not provides lzo compression for csv file format, I am required to convert it on the fly while loading these files to s3.
Vishrant
  • 15,456
  • 11
  • 71
  • 120
0
votes
1 answer

JSON aggregation using s3-dist-cp for Spark application consumption

My spark application running on AWS EMR loads data from JSON array stored in S3. The Dataframe created from it is then processed via Spark engine. My source JSON data is in the form of multiple S3 objects. I need to compact them into a JSON array to…
user1988501
  • 27
  • 1
  • 3