Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3368 questions
1
vote
2 answers

Enabling Spark Web UI in AWS EMR

I am submitting a Spark job on EMR cluster and I want to see the Spark Web UI which gives the information about the configuration and status of the master node and also worker node. Configuration Details: Release Label : emr-5.17.0 Applications :…
1
vote
1 answer

Unable to Proxy AWS EMR Jupyter-Notebook's socket through node application. Failed to load kernel

We have our node application which we want to use to proxy jupyter notebook running on AWS EMR. I am able to proxy all my http request from my node application using http-proxy-middleware. But for some reason I am unable to proxy web-socket…
1
vote
1 answer

Executing Zeppelin notebooks as recurring job in Amazon EMR

I am migrating from Databricks to Amazon EMR and planning to use Zeppelin notebooks in place of Databricks notebooks. Currently, many of the Databricks notebooks are scheduled as jobs. Is there any way how I can create recurring jobs or add…
1
vote
1 answer

how to configure aws lambda to be able to access services on the emr master node?

My AWS Lambda function can't access the hive server running on the master node. It times out -- the same behavior as if you try to access the node from a non white-listed IP. Obviously adding the Lambda function as a whitelisted IP is a non…
1
vote
0 answers

How to access EMR Web interfaces through SSM?

We use EMR web interfaces like Zeppelin and YARN ResourceManager. For EMR hosted in a private subnet, we need to use a bastion host in the public subnet. We also use the same bastion for SSH into the EMR master. However, now SSM offers a better way…
Noam Musk
  • 47
  • 1
  • 5
1
vote
0 answers

AWS EMR S3 Hive

I follow the instructions from the book titled Big Data Visualization, see https://www.amazon.com/Big-Data-Visualization-James-Miller/dp/1785281941 Basically, the steps are: a) Load in a huge text file into S3 directory /bigdatavizproject1/Input b)…
THIAM HUAT Tan
  • 71
  • 1
  • 4
  • 9
1
vote
2 answers

Option to enable glue catalog for Presto/Spark on EMR using Terraform

Wanted to know if there's support to enable aws glue catalog for Presto/Spark when running on EMR.Could not find anything in the documentation.
Atif
  • 129
  • 1
  • 14
1
vote
1 answer

Why are we seeing parquet write errors after switching to EMRFS consistent view?

we have a large ETL process running on an EMR cluster that reads and writes large number of parquet files to into S3 buckets Here is the code: a = spark.read.parquet(path1) a.registerTempTable('a') b =…
1
vote
0 answers

what happens if index creation is in progress and I'll send request to elastic?

I use elastic search 6.4 on AWS. Library: RestHighLevelClient Index/mapping creation may take few seconds/minutes. The doc tells us that when we send PutMappingRequest request, AcknowledgedResponse indicates whether the operation completed before…
grep
  • 5,465
  • 12
  • 60
  • 112
1
vote
1 answer

AWS EMR: Cluster terminated in user request

I'm trying to create new / clone existing cluster through AWS console in US West region. Cluster starts to creating but then, after 1 min (or even less), it starts to terminate itself, giving the error : Terminated by user request. I tried to…
WomenWhoCode
  • 396
  • 1
  • 6
  • 18
1
vote
0 answers

Error using pyspark: ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

I'm trying to get a pyspark script running remotely on AWS EMR, following the instructions provided by AWS. However, when I try to submit my script, I am getting the following exception: Traceback (most recent call last): File…
aco
  • 719
  • 2
  • 9
  • 26
1
vote
1 answer

SparkContext Java Deploy Job and MapReduce from AWS EMR

Hi was searching the web and amazon documentation for a general know how on to running a spark job on an existing emr yarn cluster on aws. I'm stuck in the following. I have already setup a local[*] spark cluster to test; now I want to test it on…
user11040706
1
vote
1 answer

Why do we use the Hive service principal when using beeline to connect to Hive on a Kerberos enabled EMR cluster?

I am trying to connect to Hive using beeline on an EMR cluster (Kerberos enabled) and am wondering why I'd run a kinit (using my user account) and then the following: beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/_HOST@REALM" The…
Brandon
  • 375
  • 2
  • 16
1
vote
4 answers

Tables not found in Spark SQL after migrating from EMR to AWS Glue

I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata. I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select *…
wrschneider
  • 17,913
  • 16
  • 96
  • 176
1
vote
0 answers

Spark on AWS EMR: Update to emr-5.20 (with Spark 2.4) : Jobs take more than before

Recently we upgraded EMR release label from emr-5.16.0 to emr-5.20.0, which use Spark 2.4 instead of 2.3.1. At first, it was terrible. Jobs started to take much more than before. Finally, we set maximumResourcesAllocation to true (maybe it was true…
Pedro
  • 21
  • 4
1 2 3
99
100