Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions

vote

1 answer

Amazon EMR terminateJobFlows for requests with more than maximum clusters that can be terminated

I'm using AWSElasticMapReduceJavaClient-1.11.x, and the maximum clusters that EMR can terminate at one time is 10. How would I go about terminating a request with let's say 100 clusters all in one terminateJobFlows call? I'm implementing the…

java amazon-web-services amazon-emr

asked Feb 05 '19 at 17:30

Shailesh Patel

vote

0 answers

Can I create a subdag with catchup enabled in a dag without catchup?

My goal is to schedule jobs with EmrCreateJobFlowOperator and EmrAddStepsOperator. Namely, I want to create cluster and add add steps for each scheduled day (or hour) starting from specified date. Basically, I want EmrAddStepsOperator to be…

airflow amazon-emr

asked Feb 04 '19 at 10:27

gorros

1,411
1
18
29

vote

1 answer

How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

Good day, I am running a Flink (v1.7.1) streaming job on AWS EMR 5.20, and I would like to have all task_managers and job_manager's logs of my job in S3. Logback is used as recommended by the Flink team. As it is a long-running job, I want the logs…

logging streaming hadoop-yarn amazon-emr flink-streaming

asked Feb 04 '19 at 04:53

Averell

vote

1 answer

Jupyter Notebook - AccessControlException: Permission denied: user=livy

I am running an EMR cluster with Spark/Livy, and would like to test Spark Structured Streaming. I am using the Jupyter Notebook managed service (connects via Livy) however when I try this code in Jupyter: query =…

apache-spark jupyter-notebook amazon-emr

asked Feb 03 '19 at 14:30

Tex

vote

2 answers

Spark Streaming scheduling best practices

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15…

pyspark spark-streaming amazon-emr amazon-kinesis aws-data-pipeline

asked Feb 03 '19 at 02:58

RockerZ

vote

1 answer

Why is sudo required when using pip to install libraries into a virtualenv on an AWS EMR master node?

I am trying to use pip to install libraries into a Python virtualenv, which resides on an AWS EMR master node. For some reason, sudo pip works fine, but non-sudo pip silently fails. Some background: I am launching an EMR cluster with version…

python linux pip amazon-emr

asked Jan 22 '19 at 17:13

Chris Cugliotta

vote

1 answer

Why is pyspark sql query against S3 returning nulls

I am getting different results when running the same query in Athena against an S3 source vs. doing it from within a pyspark script on an EMR ( 1 x 10) cluster. I get data back from Athena, but all I get are nulls with the script. Any…

amazon-s3 pyspark null amazon-emr amazon-athena

asked Jan 19 '19 at 01:14

Thom Rogers

1,385
2
20
33

vote

2 answers

Hive Timestamp erroring as Binary

I'm trying to insert to a table with a query in an EMR cluster on AWS. The table is creating correctly, and a colleague can run the exact same code that I'm using and it won't fail. However, when I try to run the code, I get failures in Map1 that…

amazon-web-services types hive amazon-emr

asked Jan 17 '19 at 23:53

Fish357

vote

1 answer

Upgrade EMR 5.19 to 5.20

As part of EMR Cluster version upgrade to 5.20.0 . All the frameworks in the cluster Big Data Frameworks and AWS Services versions got upgraded to latest versions. For example : Spark 2.3.2 to Spark 2.4.0 Presto 0.212 to 0.214 While testing the…

amazon-web-services amazon-emr

asked Jan 14 '19 at 11:24

asur

1,759
7
38
81

vote

0 answers

No output from full outer join query in Pyspark when we add step to EMR or from Zepplin (AWS-EMR) And From Pyspark shell results are fine

When I do full outer join in Pyspark is not giving output. from __future__ import print_function import sys import json import os from pyspark.conf import SparkConf from functools import reduce from pyspark.sql import SparkSession…

sql pyspark apache-spark-sql amazon-emr

asked Jan 12 '19 at 10:17

Radhika k

vote

1 answer

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running…

apache-spark hadoop-yarn amazon-emr

asked Jan 10 '19 at 14:55

conradlee

12,985
17
57
93

vote

1 answer

Error while connecting to AWS EMR cluster from mac

I'm trying to create 3 node AWS EMR cluster. I have also create a key to connect to cluster from macOS with command : ssh -i ~/Downloads/BigdataKey.pem hadoop@ec2-xx-xx-xx-xx.ap-south-1.compute.amazonaws.com But its giving error : 192:Downloads…

macos amazon-web-services amazon-emr

asked Dec 25 '18 at 04:04

Nagesh Singh Chauhan

vote

1 answer

In Terraform, can I recreate an EMR cluster resource when its bootstrap action contents change?

I'm not quite sure how to solve this problem in terraform. We have an EMR cluster, with some bootstrap actions that are specified as S3 resources. A simplified view of our terraform config is: resource "aws_s3_bucket_object" "bootstrap_action" { …

terraform amazon-emr terraform-provider-aws

asked Dec 21 '18 at 15:13

Dave DeCaprio

2,051
17
31

vote

0 answers

EMR Cluster utilization

I have a 20 mode c4.4xlarge cluster to run a spark job. Each node is a 16 vCore, 30 GiB memory, EBS only storage EBS Storage:32 GiB machine. Since each node has 16 vCore, I understand that maximum number of executors are 16*20 > 320 executors. Total…

apache-spark distributed-computing amazon-emr

asked Dec 20 '18 at 23:37

Abhi

1,153
1
23
38

vote

1 answer

aws emr zeppelin doesn't have jdbc interpreter

I created a aws emr cluster with hadoop, spark and zeppelin. Following the document https://zeppelin.apache.org/docs/0.8.0/interpreter/jdbc.html , which says Fill Interpreter name field with whatever you want to use as the alias(e.g. mysql,…

amazon-web-services amazon-emr apache-zeppelin

asked Dec 20 '18 at 10:17

Mithril

12,947
18
102
153

Prev 1 2 3

…

100