Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions

vote

2 answers

Enabling Spark Web UI in AWS EMR

I am submitting a Spark job on EMR cluster and I want to see the Spark Web UI which gives the information about the configuration and status of the master node and also worker node. Configuration Details: Release Label : emr-5.17.0 Applications :…

asked Mar 21 '19 at 21:53

Rawss24

vote

1 answer

Unable to Proxy AWS EMR Jupyter-Notebook's socket through node application. Failed to load kernel

We have our node application which we want to use to proxy jupyter notebook running on AWS EMR. I am able to proxy all my http request from my node application using http-proxy-middleware. But for some reason I am unable to proxy web-socket…

node.js express websocket jupyter-notebook amazon-emr

asked Mar 13 '19 at 22:17

Ankit Basarkar

vote

1 answer

Executing Zeppelin notebooks as recurring job in Amazon EMR

I am migrating from Databricks to Amazon EMR and planning to use Zeppelin notebooks in place of Databricks notebooks. Currently, many of the Databricks notebooks are scheduled as jobs. Is there any way how I can create recurring jobs or add…

amazon-web-services apache-spark pyspark amazon-emr apache-zeppelin

asked Mar 13 '19 at 09:15

Chandan392

vote

1 answer

how to configure aws lambda to be able to access services on the emr master node?

My AWS Lambda function can't access the hive server running on the master node. It times out -- the same behavior as if you try to access the node from a non white-listed IP. Obviously adding the Lambda function as a whitelisted IP is a non…

amazon-web-services aws-lambda amazon-emr system-administration

asked Mar 11 '19 at 18:29

Walrus the Cat

2,314
5
35
64

vote

0 answers

How to access EMR Web interfaces through SSM?

We use EMR web interfaces like Zeppelin and YARN ResourceManager. For EMR hosted in a private subnet, we need to use a bastion host in the public subnet. We also use the same bastion for SSH into the EMR master. However, now SSM offers a better way…

amazon-web-services amazon-emr ssh-tunnel ssm

asked Mar 07 '19 at 18:43

Noam Musk

vote

0 answers

AWS EMR S3 Hive

I follow the instructions from the book titled Big Data Visualization, see https://www.amazon.com/Big-Data-Visualization-James-Miller/dp/1785281941 Basically, the steps are: a) Load in a huge text file into S3 directory /bigdatavizproject1/Input b)…

amazon-web-services amazon-s3 hiveql amazon-emr

asked Mar 05 '19 at 03:05

THIAM HUAT Tan

vote

2 answers

Option to enable glue catalog for Presto/Spark on EMR using Terraform

Wanted to know if there's support to enable aws glue catalog for Presto/Spark when running on EMR.Could not find anything in the documentation.

terraform amazon-emr terraform-provider-aws

asked Feb 28 '19 at 05:05

Atif

vote

1 answer

Why are we seeing parquet write errors after switching to EMRFS consistent view?

we have a large ETL process running on an EMR cluster that reads and writes large number of parquet files to into S3 buckets Here is the code: a = spark.read.parquet(path1) a.registerTempTable('a') b =…

apache-spark amazon-s3 pyspark amazon-emr

asked Feb 26 '19 at 14:38

James Swarowski

vote

0 answers

what happens if index creation is in progress and I'll send request to elastic?

I use elastic search 6.4 on AWS. Library: RestHighLevelClient Index/mapping creation may take few seconds/minutes. The doc tells us that when we send PutMappingRequest request, AcknowledgedResponse indicates whether the operation completed before…

elasticsearch elastic-stack amazon-emr amazon-elastic-beanstalk

asked Feb 20 '19 at 19:17

grep

5,465
12
60
112

vote

1 answer

AWS EMR: Cluster terminated in user request

I'm trying to create new / clone existing cluster through AWS console in US West region. Cluster starts to creating but then, after 1 min (or even less), it starts to terminate itself, giving the error : Terminated by user request. I tried to…

amazon-web-services amazon-emr

asked Feb 19 '19 at 15:15

WomenWhoCode

vote

0 answers

Error using pyspark: ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

I'm trying to get a pyspark script running remotely on AWS EMR, following the instructions provided by AWS. However, when I try to submit my script, I am getting the following exception: Traceback (most recent call last): File…

apache-spark pyspark hadoop-yarn amazon-emr

asked Feb 19 '19 at 01:09

aco

vote

1 answer

SparkContext Java Deploy Job and MapReduce from AWS EMR

Hi was searching the web and amazon documentation for a general know how on to running a spark job on an existing emr yarn cluster on aws. I'm stuck in the following. I have already setup a local[*] spark cluster to test; now I want to test it on…

java amazon-web-services apache-spark amazon-emr

asked Feb 10 '19 at 12:09

user11040706

vote

1 answer

Why do we use the Hive service principal when using beeline to connect to Hive on a Kerberos enabled EMR cluster?

I am trying to connect to Hive using beeline on an EMR cluster (Kerberos enabled) and am wondering why I'd run a kinit (using my user account) and then the following: beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/_HOST@REALM" The…

hadoop hive kerberos amazon-emr beeline

asked Feb 09 '19 at 03:45

Brandon

vote

4 answers

Tables not found in Spark SQL after migrating from EMR to AWS Glue

I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata. I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select *…

apache-spark amazon-emr aws-glue

asked Feb 08 '19 at 16:28

wrschneider

17,913
16
96
176

vote

0 answers

Spark on AWS EMR: Update to emr-5.20 (with Spark 2.4) : Jobs take more than before

Recently we upgraded EMR release label from emr-5.16.0 to emr-5.20.0, which use Spark 2.4 instead of 2.3.1. At first, it was terrible. Jobs started to take much more than before. Finally, we set maximumResourcesAllocation to true (maybe it was true…

java apache-spark amazon-emr

asked Feb 07 '19 at 20:05

Pedro

Prev 1 2 3

…

100 Next