Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
1
vote
1 answer

Amazon EMR terminateJobFlows for requests with more than maximum clusters that can be terminated

I'm using AWSElasticMapReduceJavaClient-1.11.x, and the maximum clusters that EMR can terminate at one time is 10. How would I go about terminating a request with let's say 100 clusters all in one terminateJobFlows call? I'm implementing the…
1
vote
0 answers

Can I create a subdag with catchup enabled in a dag without catchup?

My goal is to schedule jobs with EmrCreateJobFlowOperator and EmrAddStepsOperator. Namely, I want to create cluster and add add steps for each scheduled day (or hour) starting from specified date. Basically, I want EmrAddStepsOperator to be…
gorros
  • 1,411
  • 1
  • 18
  • 29
1
vote
1 answer

How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

Good day, I am running a Flink (v1.7.1) streaming job on AWS EMR 5.20, and I would like to have all task_managers and job_manager's logs of my job in S3. Logback is used as recommended by the Flink team. As it is a long-running job, I want the logs…
Averell
  • 793
  • 2
  • 10
  • 21
1
vote
1 answer

Jupyter Notebook - AccessControlException: Permission denied: user=livy

I am running an EMR cluster with Spark/Livy, and would like to test Spark Structured Streaming. I am using the Jupyter Notebook managed service (connects via Livy) however when I try this code in Jupyter: query =…
Tex
  • 89
  • 2
  • 10
1
vote
2 answers

Spark Streaming scheduling best practices

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15…
1
vote
1 answer

Why is sudo required when using pip to install libraries into a virtualenv on an AWS EMR master node?

I am trying to use pip to install libraries into a Python virtualenv, which resides on an AWS EMR master node. For some reason, sudo pip works fine, but non-sudo pip silently fails. Some background: I am launching an EMR cluster with version…
Chris Cugliotta
  • 119
  • 2
  • 3
1
vote
1 answer

Why is pyspark sql query against S3 returning nulls

I am getting different results when running the same query in Athena against an S3 source vs. doing it from within a pyspark script on an EMR ( 1 x 10) cluster. I get data back from Athena, but all I get are nulls with the script. Any…
Thom Rogers
  • 1,385
  • 2
  • 20
  • 33
1
vote
2 answers

Hive Timestamp erroring as Binary

I'm trying to insert to a table with a query in an EMR cluster on AWS. The table is creating correctly, and a colleague can run the exact same code that I'm using and it won't fail. However, when I try to run the code, I get failures in Map1 that…
Fish357
  • 87
  • 8
1
vote
1 answer

Upgrade EMR 5.19 to 5.20

As part of EMR Cluster version upgrade to 5.20.0 . All the frameworks in the cluster Big Data Frameworks and AWS Services versions got upgraded to latest versions. For example : Spark 2.3.2 to Spark 2.4.0 Presto 0.212 to 0.214 While testing the…
asur
  • 1,759
  • 7
  • 38
  • 81
1
vote
0 answers

No output from full outer join query in Pyspark when we add step to EMR or from Zepplin (AWS-EMR) And From Pyspark shell results are fine

When I do full outer join in Pyspark is not giving output. from __future__ import print_function import sys import json import os from pyspark.conf import SparkConf from functools import reduce from pyspark.sql import SparkSession…
Radhika k
  • 11
  • 3
1
vote
1 answer

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running…
conradlee
  • 12,985
  • 17
  • 57
  • 93
1
vote
1 answer

Error while connecting to AWS EMR cluster from mac

I'm trying to create 3 node AWS EMR cluster. I have also create a key to connect to cluster from macOS with command : ssh -i ~/Downloads/BigdataKey.pem hadoop@ec2-xx-xx-xx-xx.ap-south-1.compute.amazonaws.com But its giving error : 192:Downloads…
1
vote
1 answer

In Terraform, can I recreate an EMR cluster resource when its bootstrap action contents change?

I'm not quite sure how to solve this problem in terraform. We have an EMR cluster, with some bootstrap actions that are specified as S3 resources. A simplified view of our terraform config is: resource "aws_s3_bucket_object" "bootstrap_action" { …
Dave DeCaprio
  • 2,051
  • 17
  • 31
1
vote
0 answers

EMR Cluster utilization

I have a 20 mode c4.4xlarge cluster to run a spark job. Each node is a 16 vCore, 30 GiB memory, EBS only storage EBS Storage:32 GiB machine. Since each node has 16 vCore, I understand that maximum number of executors are 16*20 > 320 executors. Total…
Abhi
  • 1,153
  • 1
  • 23
  • 38
1
vote
1 answer

aws emr zeppelin doesn't have jdbc interpreter

I created a aws emr cluster with hadoop, spark and zeppelin. Following the document https://zeppelin.apache.org/docs/0.8.0/interpreter/jdbc.html , which says Fill Interpreter name field with whatever you want to use as the alias(e.g. mysql,…
Mithril
  • 12,947
  • 18
  • 102
  • 153
1 2 3
99
100