Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1166 questions
10
votes
1 answer

ClusterID vs JobFlowID on AWS EMR

I am a bit confused about the APIs available and the two identifiers. I am using boto, but don't think that is the problem here : my question regards any api (but not cli). I start a JobFlow with RunJobFlow which returns me a JobFlowId. Let's assume…
user2123288
  • 1,103
  • 1
  • 13
  • 22
10
votes
1 answer

How to edit and relaunch a terminated cluster on Amazon EMR?

I am new to AWS and Amazon EMR. I have created a new cluster with a custom Bootstrap script. When I launched the clulster, it terminated on failure of the Bootstrap script. I have now fixed my script, and want to relaunch theNow, in the EMR console…
kajarigd
  • 1,299
  • 3
  • 28
  • 46
9
votes
2 answers

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating. My settings…
user3407267
  • 1,524
  • 9
  • 30
  • 57
9
votes
1 answer

Run Command on EMR Slaves?

I'm trying to update a running EMR cluster with pip install on all the slave machines. How can I do that? I can't do it with a bootstrap step because it is a long running EMR and I can't take it down. The EMR cluster is running Spark & Yarn, so I…
9
votes
1 answer

How to recover EMR from "Terminated with errors Instance failure" Status

I am new to AWS EMR, several days ago I stopped(not terminated) the EMR EC2 instances and then the EMR cluster status become "Terminated with errors Instance failure", how to recover it? I cannot find the related EC2 instances anymore.
Andy Nie
  • 103
  • 1
  • 5
9
votes
1 answer

Add streaming step to MR job in boto3 running on AWS EMR 5.0

I'm trying to migrate a couple of MR jobs that I have written in python from AWS EMR 2.4 to AWS EMR 5.0. Till now I was using boto 2.4, but it doesn't support EMR 5.0, so I'm trying to shift to boto3. Earlier, while using boto 2.4, I used the…
m_amber
  • 747
  • 3
  • 13
  • 23
9
votes
1 answer

How to properly provide credentials for spark-redshift in EMR instances?

We were trying to use the spark-redshift project, following the 3rd recommendation for providing the credentials. Namely: IAM instance profiles: If you are running on EC2 and authenticate to S3 using IAM and instance profiles, then you must must…
ale64bit
  • 6,232
  • 3
  • 24
  • 44
9
votes
1 answer

spark-submit EMR Step failing when submitted using boto3 client

I'm trying to execute spark-submit using boto3 client for EMR. After executing the code below, EMR step submitted and after few seconds failed. The actual command line from step logs is working if executed manually on EMR master. Controller log…
Robert Navado
  • 1,319
  • 11
  • 14
9
votes
1 answer

AWS EMR Step failed as jobs it created failed

I'm trying to analyse a Wikipedia article view dataset using Amazon EMR. This data set contains page view statistics over a three month period (1 Jan 2011 - 31 March 2011). I am trying to find the article with the most views over that time. Here is…
spoon
  • 101
  • 1
  • 6
9
votes
1 answer

Lambda to create EMR Cluster don't fire the cluster creation

I'm trying to run a λ code that creates a cluster, but nothing happens, maybe I'm misunderstanding the usage on Node (since I'm not that familiar with it.) The function is as simple as: // configure AWS Dependecies var AWS =…
Diego Magalhães
  • 725
  • 1
  • 10
  • 32
9
votes
1 answer

EMR activity stuck in Waiting_For_Runner state

I am creating a data pipeline to export dynamoDB table to S3 bucket.I used the standard template to use for this in data pipeline console. I ha verified that the runsOn field is set to the name of EMR cluster to be started. However, The EMR activity…
user3610975
  • 113
  • 2
  • 4
8
votes
1 answer

Airflow - Task Instance in EMR operator

In Airflow, I'm facing the issue that I need to pass the job_flow_id to one of my emr-steps. I am capable of retrieving the job_flow_id from the operator but when I am going to create the steps to submit to the cluster, the task_instance value is…
spaghettifunk
  • 1,936
  • 4
  • 24
  • 46
8
votes
1 answer

How to set spark.driver.memory for Spark/Zeppelin on EMR

When using EMR (with Spark, Zeppelin), changing spark.driver.memory in Zeppelin Spark interpreter settings won't work. I wonder what is the best and quickest way to set Spark driver memory when using EMR web interface (not aws CLI) to create…
Rami
  • 8,044
  • 18
  • 66
  • 108
8
votes
2 answers

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that: [ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { …
Rami
  • 8,044
  • 18
  • 66
  • 108
8
votes
1 answer

Installing Python packages via Bootstrap Actions for PySpark on EMR

I've got a problem that's driving me crazy partly because it's so simple. So I have an ETL job I'd like to perform using pySpark on EMR. Problem is there are packages I need to install such as: numpy, py-stringmatching, etc. and I can't seem to…