Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
1
vote
0 answers

MapReduce job with AWS Elastic MapReduce EMR - why 648 MB input was split into 27 map tasks?

I used AWS EMR (Hadoop streaming) for processing 648 MB input data in 9 text files (approx. 72 MB each stored in s3). I thought it split the data into either 64MB or 128MB blocks, but the log says that it split into 27 map tasks (I think one map…
1
vote
1 answer

AWS EMR: Spark - SparkException java IOException: Failed to create local dir in /tmp/blockmgr*

I have a AWS EMR cluster with Spark. I can connect to it (spark): from master node after SSHing into it from another AWS EMR cluster But NOT able to connect to it: from my local machine (macOS Mojave) from non-emr machines like Metabase and…
user954311
  • 41
  • 3
1
vote
0 answers

Trigger python script on EMR from Lambda on S3 Object arrival with Object details

I am trying to trigger a lambda function object arrival in s3 alonng with object details like name and path. then trigger python script on EMR which will access the file which is on s3. Please let me know how i can trigger python script (may within…
RajaR
  • 11
  • 4
1
vote
2 answers

How to handle big reference data in Spark

I have big data set (lets say 4gb) that is used as a reference source to process another big data set (100-200gb) I have cluster for 30 executors to do that on 10 nodes. So for every executor I have own jvm, right? Everytime it loads whole reference…
jk1
  • 593
  • 6
  • 16
1
vote
2 answers

Usage of StreamingFileSink is throwing NoClassDefFoundError

I know this could be my problem, but trying to graple it for a while. I am trying to run flink in AWS EMR cluster. My setup is: Time series event from Kinesis -> flink job -> Save it to S3 DataStream kinesis = …
Nischit
  • 161
  • 1
  • 8
1
vote
0 answers

How to kill all Spark processes from within mapPartitions running on EMR slave nodes?

We are running pyspark in an EMR cluster and have ~50 million records in a dataframe. Each needs a field added to it from an API, which accepts 100 records at a time (so ~500k total requests). We are able to split them up and make the API calls…
kylerm42
  • 53
  • 4
1
vote
1 answer

Accessing Hive Tables with Spark SQL

I've setup an AWS EMR cluster that includes spark 2.3.2, hive 2.3.3, and hbase 1.4.7. How can I configure spark to access hive tables? I've taken the following steps, but the result is the error message: java.lang.ClassNotFoundException:…
Ari
  • 4,121
  • 8
  • 40
  • 56
1
vote
1 answer

Exceptions while running Spark job on EMR cluster "java.io.IOException: All datanodes are bad"

We have AWS EMR setup to process jobs which are written in Scala. We are able to run the jobs on small dataset, but while running same job on large dataset I get exception "java.io.IOException: All datanodes are bad."
Devendra Parhate
  • 135
  • 1
  • 2
  • 12
1
vote
0 answers

S3 Bucket Policy to Allow access to specific AWS services and users and restrict other all

I have a bucket policy which is restricting other users to access. But I want, For aws services it should be accessible like EMR etc. I found same question is asked here: S3 Bucket Policy to Allow access to specific users and restrict all . But I…
1
vote
2 answers

Loading parquet file from S3 to DynamoDB

I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind, I cannot use AWS Data pipeline File…
1
vote
0 answers

Samza 1.1.0 - run-app.sh does not work during deployment of hello samza

I am facing errors when I deploy the hello samza tutorial on yarn following the documentation. Particularly, I was getting errors when I run the run-app.sh script as mentioned. I am currently using Samza 1.1.0 on AWS EMR (emr - 5.13.0, amazon 2.8.3,…
Harsha
  • 11
  • 3
1
vote
1 answer

Get aws EMR DNS address using CLI

I am trying to set up some easy code to run when trying to spin up an EMR for some ad hoc work I have to do, time to time. Right now I pass the 'aws emr create-cluster' command and then find the DNS in the console, once the cluster is created to…
1
vote
1 answer

How to efficiently read/parse loads of .gz files in a s3 folder with spark on EMR

I'm trying to read all files in a directory on s3 via a spark app that's executing on EMR. The data is store in a typical format like "s3a://Some/path/yyyy/mm/dd/hh/blah.gz" If I use deeply nested wildcards (e.g.…
User
  • 168
  • 1
  • 3
  • 19
1
vote
1 answer

Problem importing modules from a .zip file (created in python using zipfile package) with --py-files on an EMR in Spark

I am trying to archive my application in my test file to spark submit on an EMR cluster like this: Folder structure of modules: app --- module1 ------ test.py ------ test2.py --- module2 ------ file1.py ------ file2.py Zip function I'm calling from…
Collin Rea
  • 23
  • 5
1
vote
1 answer

How to change the Apache Zeppelin UI appearance and make edits to elements

I'm currently running Apache Zeppelin 0.7.2 on an AWS EMR machine. Is there any way to replace the zeppelin logo and words at the top with any other text and images? I tried to use the Inspect Elements feature in Chrome on the Zeppelin Webpage and…